Spark or Hadoop Is it an either-or
proposition
By Slim Baltagi (SlimBaltagi)
Big Data Practice Director
Advanced Analytics LLC
OR
XOR
Los Angeles Spark Users Group
March 12 2015
Your Presenter ndash Slim Baltagi
2
bull Sr Big Data Solutions Architect
living in Chicago
bull Over 17 years of IT and business
experiences
bull Over 4 years of Big Data
experience working on over 12
Hadoop projects
bull Speaker at Big Data events
bull Creator and maintainer of the
Apache Spark Knowledge
Base
httpwwwSparkBigDatacom
with over 4000 categorized
Apache Spark web resources
SlimBaltagi
httpswwwlinkedincominslimbalta
gi
sbaltagigmailcom
Disclaimer This is a vendor-independent talk that expresses my own
opinions I am not endorsing nor promoting any product or vendor mentioned in
this talk
Agenda
I Motivation
II Big Data Typical Big Data
Stack Apache Hadoop
Apache Spark
III Spark with Hadoop
IV Spark without Hadoop
V More QampA
3
I Motivation
1 News
2 Surveys
3 Vendors
4 Analysts
5 Key Takeaways
4
1 Newsbull Is it Spark vs OR and Hadoop
bull Apache Spark Hadoop friend or foe
bull Apache Spark killer or savior of Apache Hadoop
bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye
bull Adios Hadoop Hola Spark
bull Apache Spark Moving on from Hadoop
bull Apache Spark Continues to Spread Beyond Hadoop
bull Escape From Hadoop
bull Spark promises to end up Hadoop but in a good way
5
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Your Presenter ndash Slim Baltagi
2
bull Sr Big Data Solutions Architect
living in Chicago
bull Over 17 years of IT and business
experiences
bull Over 4 years of Big Data
experience working on over 12
Hadoop projects
bull Speaker at Big Data events
bull Creator and maintainer of the
Apache Spark Knowledge
Base
httpwwwSparkBigDatacom
with over 4000 categorized
Apache Spark web resources
SlimBaltagi
httpswwwlinkedincominslimbalta
gi
sbaltagigmailcom
Disclaimer This is a vendor-independent talk that expresses my own
opinions I am not endorsing nor promoting any product or vendor mentioned in
this talk
Agenda
I Motivation
II Big Data Typical Big Data
Stack Apache Hadoop
Apache Spark
III Spark with Hadoop
IV Spark without Hadoop
V More QampA
3
I Motivation
1 News
2 Surveys
3 Vendors
4 Analysts
5 Key Takeaways
4
1 Newsbull Is it Spark vs OR and Hadoop
bull Apache Spark Hadoop friend or foe
bull Apache Spark killer or savior of Apache Hadoop
bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye
bull Adios Hadoop Hola Spark
bull Apache Spark Moving on from Hadoop
bull Apache Spark Continues to Spread Beyond Hadoop
bull Escape From Hadoop
bull Spark promises to end up Hadoop but in a good way
5
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Agenda
I Motivation
II Big Data Typical Big Data
Stack Apache Hadoop
Apache Spark
III Spark with Hadoop
IV Spark without Hadoop
V More QampA
3
I Motivation
1 News
2 Surveys
3 Vendors
4 Analysts
5 Key Takeaways
4
1 Newsbull Is it Spark vs OR and Hadoop
bull Apache Spark Hadoop friend or foe
bull Apache Spark killer or savior of Apache Hadoop
bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye
bull Adios Hadoop Hola Spark
bull Apache Spark Moving on from Hadoop
bull Apache Spark Continues to Spread Beyond Hadoop
bull Escape From Hadoop
bull Spark promises to end up Hadoop but in a good way
5
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
I Motivation
1 News
2 Surveys
3 Vendors
4 Analysts
5 Key Takeaways
4
1 Newsbull Is it Spark vs OR and Hadoop
bull Apache Spark Hadoop friend or foe
bull Apache Spark killer or savior of Apache Hadoop
bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye
bull Adios Hadoop Hola Spark
bull Apache Spark Moving on from Hadoop
bull Apache Spark Continues to Spread Beyond Hadoop
bull Escape From Hadoop
bull Spark promises to end up Hadoop but in a good way
5
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Newsbull Is it Spark vs OR and Hadoop
bull Apache Spark Hadoop friend or foe
bull Apache Spark killer or savior of Apache Hadoop
bull Apache Sparks Marriage To Hadoop Will Be Bigger Than Kim And Kanye
bull Adios Hadoop Hola Spark
bull Apache Spark Moving on from Hadoop
bull Apache Spark Continues to Spread Beyond Hadoop
bull Escape From Hadoop
bull Spark promises to end up Hadoop but in a good way
5
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2 Surveysbull Hadoops historic focus on batch processing of data
was well supported by MapReduce but there is an
appetite for more flexible developer tools to support
the larger market of mid-size datasets and use cases
that call for real-time processingrdquo 2015 Apache Spark
Survey by Typesafe January 27 2015
httpwwwmarketwiredcompress-releasesurvey-indicates-apache-spark-
gaining-developer-adoption-as-big-datas-projects-1986162htm
bull Apache Spark Preparing for the Next Wave of
Reactive Big Data January 27 2015 by Typesafe
httptypesafecomblogapache-spark-preparing-for-the-next-wave-of-reactive-
big-data
6
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Apache Spark Survey 2015 by
Typesafe - Quick Snapshot
7
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Vendors
8
bull Spark and Hadoop Working Together January 21
2014 by Ion Stoica httpsdatabrickscomblog20140121spark-and-
hadoophtml
bull Uniform API for diverse workloads over diverse
storage systems and runtimes
Source Slide 16 of lsquoSparks Role in the Big Data Ecosystem (Spark
Summit 2014) November 2014 Matei
Zahariahttpwwwslidesharenetdatabricksspark-summit2014
bull The goal of Apache Spark is to have one engine for all
data sources workloads and environmentsrdquo
Source Slide 15 of lsquoNew Directions for Apache Spark in 2015
February 20 2015 Strata + Hadoop Summit Matei Zaharia
httpwwwslidesharenetdatabricksnew-directions-for-apache-spark-in-2015
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Vendorsbull ldquoSpark is already an excellent piece of software and is
advancing very quickly No vendor mdash no new project mdashis likely to catch up Chasing Spark would be a wasteof time and would delay availability of real-time analyticand processing services for no good reason rdquoSource MapReduce and Spark December 302013 httpvisionclouderacommapreduce-spark
bull ldquoApache Spark is an open source parallel dataprocessing framework that complements ApacheHadoop to make it easy to develop fast unified Big Dataapplications combining batch streaming and interactiveanalytics on all your datardquohttpwwwclouderacomcontentclouderaenproducts-and-servicescdhsparkhtml
9
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Vendorsbull ldquoApache Spark is a general-purpose engine for large-
scale data processing Spark supports rapid application
development for big data and allows for code reuse
across batch interactive and streaming applications
Spark also provides advanced execution graphs with in-
memory pipelining to speed up end-to-end application
performancerdquo httpswwwmaprcomproductsapache-spark
bull MapR Adds Complete Apache Spark Stack to its
Distribution for Hadoophttpswwwmaprcomcompanypress-releasesmapr-adds-complete-apache-
spark-stack-its-distribution-hadoop
10
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Vendorsbull ldquoApache Spark provides an elegant attractive development API and allows data workers to rapidly iterate over data via machine learning and other data science techniques that require fast in-memory data processingrdquo httphortonworkscomhadoopspark
bull Hortonworks A shared vision for Apache Spark on Hadoop October 212014httpsdatabrickscomblog20141031hortonworks-a-shared-vision-for-apache-spark-on-hadoophtml
bull ldquoAt Hortonworks we love Spark and want to help our customers leverage all its benefitsrdquo October 30th 2014httphortonworkscomblogimproving-spark-data-pipelines-native-yarn-integration
11
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Analystsbull Is Apache Spark replacing Hadoop or complementing
existing Hadoop practice
bull Both are already happening
bull With uncertainty about ldquowhat is Hadooprdquo there is no
reason to think solution stacks built on Spark not
positioned as Hadoop will not continue to proliferate
as the technology matures
bull At the same time Hadoop distributions are all
embracing Spark and including it in their offerings
Source Hadoop Questions from Recent Webinar Span Spectrum
February 25 2015httpblogsgartnercommerv-adrian20150225hadoop-
questions-from-recent-webinar-span-spectrum
12
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Analysts bull ldquoAfter hearing the confusion between Spark and
Hadoop one too many times I was inspired to write a report The Hadoop Ecosystem Overview Q4 2104
bull For those that have day jobs that donrsquot include constantly tracking Hadoop evolution I dove in and worked with Hadoop vendors and trusted consultants to create a framework
bull We divided the complex Hadoop ecosystem into a core set of tools that all work closely with data stored in Hadoop File System and extended group of components that leverage but do not require itrdquo Source Elephants Pigs Rhinos and Giraphs Oh My ndash Its Time To Get A Handle On Hadoop Posted by Brian Hopkins on November 26 2014
httpblogsforrestercombrian_hopkins14-11-26-elephants_pigs_rhinos_and_giraphs_oh_my_its_time_to_get_a_handle_on_hadoop
13
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
5 Key Takeaways
1 News Big Data is no longer a Hadoop
monopoly
2 Surveys Listen to what Spark developers are
saying
3 Vendors ltHadoop Vendorgt-tinted goggles
FUD is still being lsquoofferedrsquo by some Hadoop
vendors Claims need to be contextualized
4 Analysts Thorough understanding of the
market dynamics
14
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
II Big Data Typical Big Data
Stack Hadoop Spark
1 Big Data
2 Typical Big Data Stack
3 Apache Hadoop
4 Apache Spark
5 Key Takeaways
15
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Big Databull Big Data is still one of the most inflated buzzword of
the last years
bull Big Data is a broad term for data sets so large or
complex that traditional data processing tools are
inadequate httpenwikipediaorgwikiBig_data
bull Hadoop is becoming a traditional tool Above
definition is inadequate
bull ldquoBig Data refers to datasets and flows large enough
that has outpaced our capability to store process
analyze and understandrdquo Amir H Payberah
Swedish Institute of Computer Science (SICS)
16
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2 Typical Big Data Stack
17
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Apache Hadoopbull Apache Hadoop as an example of a Typical Big Data
Stack
bull Hadoop ecosystem = Hadoop Stack + many other tools
(either open source and free or commercial ones)
bull Big Data Ecosystem Dataset httpbigdataandreamostosiname
Incomplete but a useful list of Big Data related projects
packed into a JSON dataset
bull Hadoops Impact on Data Managements Future - Amr
Awadallah (Strata + Hadoop 2015) February 19 2015 Watch
video at 236 on lsquoHadoop Isnrsquot Just Hadoop Anymorersquo for a picture
representing the evolution of Apache Hadoop
httpswwwyoutubecomwatchv=1KvTZZAkHy0
18
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Apache Sparkbull Apache Spark as an example of a Typical Big Data Stack
bull Apache Spark provides you Big Data computing and more
bull BYOS Bring Your Own Storage
bull BYOC Bring Your Own Cluster
bull Spark Core httpsparkbigdatacomcomponenttagstag11-core-spark
bull Spark Streaming httpsparkbigdatacomcomponenttagstag3-spark-streaming
bull Spark SQL httpsparkbigdatacomcomponenttagstag4-spark-sql
bull MLlib (Machine Learning) httpsparkbigdatacomcomponenttagstag5-mllib
bull GraphX httpsparkbigdatacomcomponenttagstag6-graphx
bull Spark ecosystem is emerging fast with roots from BDAS Berkley Data Analytics Stack and new tools from both the open source community and commercial one Irsquom compiling a list Stay tuned
19
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
5 Key Takeaways
1 Big Data Still one of the most inflated
buzzword
2 Typical Big Data Stack Big Data Stacks look
similar on paper Arenrsquot they
3 Apache Hadoop Hadoop is no longer
lsquosynonymousrsquo of Big Data
4 Apache Spark Emergence of the Apache
Spark ecosystem
20
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
21
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution of Programming APIs
bull MapReduce in Java is like assembly code of Big
Data httpwikiapacheorghadoopWordCount
bull Pig httppigapacheorg
bull Hive httphiveapacheorg
bull Scoobi A Scala productivity framework for HadoophttpsgithubcomNICTAscoobi
bull Cascading httpwwwcascadingorg
bull Scalding A Scala API for Cascading httptwittercomscalding
bull Crunch httpcrunchapacheorg
bull Scrunch httpcrunchapacheorgscrunchhtml
22
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution of Compute ModelsWhen the Apache Hadoop project started in 2007
MapReduce v1 was the only choice as a compute model
(Execution Engine) on Hadoop Now we have in addition
to MapReduce v2 Tez Spark and Flink
23
bull Batch bull Batch
bull Interactive
bull Batch
bull Interactive
bull Near-Real
time
bull Batch
bull Interactive
bull Real-Time
bull Iterative
bull 1st
Generation
bull 2nd
Generation
bull 3rd
Generation
bull 4th
Generation
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution
bull This is how Hadoop MapReduce is branding itself ldquoA YARN-based system for parallel processing of large data sets httphadoopapacheorg
bull Batch Scalability Abstractions ( See slide on evolution of Programming APIs) User Defined Functions (UDFs)hellip
bull Hadoop MapReduce (MR) works pretty well if you can express your problem as a single MR job In practice most problems dont fit neatly into a single MR job
bull Need to integrate many disparate tools for advanced Big Data Analytics for Queries Streaming Analytics Machine Learning and Graph Analytics
24
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution
bull Tez Hindi for ldquospeedrdquo
bull This is how Apache Tez is branding itself ldquoTheApache Tez project is aimed at building anapplication framework which allows for a complexdirected-acyclic-graph of tasks for processingdata It is currently built atop YARNrdquo
Source httptezapacheorg
bull Apachetrade Tez is an extensible framework for
building high performance batch and
interactive data processing applicationscoordinated by YARN in Apache Hadoop
25
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution
bull lsquoSparkrsquo for lightning fast speed
bull This is how Apache Spark is branding itselfldquoApache Sparktrade is a fast and general engine forlarge-scale data processingrdquo httpssparkapacheorg
bull Apache Spark is a general purpose clustercomputing framework its execution modelsupports wide variety of use cases batchinteractive near-real time
bull The rapid in-memory processing of resilientdistributed datasets (RDDs) is the ldquocorecapabilityrdquo of Apache Spark
26
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 Evolution Apache Flink
bull Flink German for ldquonimble swift speedyrdquo
bull This is how Apache Flink is branding itself ldquoFast andreliable large-scale data processing enginerdquo
bull Apache Flink httpflinkapacheorg offers
bull Batch and Streaming in the same system
bull Beyond DAGs (Cyclic operator graphs)
bull Powerful expressive APIs
bull Inside-the-system iterations
bull Full Hadoop compatibility
bull Automatic language independent optimizer
bull lsquoFlinkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag27-flink
27
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Hadoop MapReduce vs Tez vs Spark
Criteria
License Open Source
Apache 20 version
2x
Open Source
Apache 20
version 0x
Open Source
Apache 20 version
1x
Processing
Model
On-Disk (Disk-
based
parallelization)
Batch
On-Disk Batch
Interactive
In-Memory On-Disk
Batch Interactive
Streaming (Near Real-
Time)
Language written
in
Java Java Scala
API [Java Python
Scala] User-Facing
Java[
ISVEngineTool
builder]
[Scala Java Python]
User-Facing
Libraries None separate tools None [Spark Core Spark
Streaming Spark SQL
MLlib GraphX]
28
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Hadoop MapReduce vs Tez vs Spark
Criteria
Installation Bound to Hadoop Bound to Hadoop Isnrsquot bound to
Hadoop
Ease of Use Difficult to program
needs abstractions
No Interactive mode
except Hive Pig
Difficult to program
No Interactive
mode except Hive
Pig
Easy to program
no need of
abstractions
Interactive mode
Compatibilit
y
to data types and data
sources is same
to data types and
data sources is
same
to data types and
data sources is
same
YARN
integration
YARN application Ground up YARN
application
Spark is moving
towards YARN
29
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Hadoop MapReduce vs Tez vs Spark
Criteria
Deployment YARN YARN [Standalone YARN
SIMR Mesos hellip]
Performance - Good performance
when data fits into
memory
- performance
degradation otherwise
Security More features and
projects
More
features and
projects
Still in its infancy
30
Partial support
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
31
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2 Transition
bull Existing Hadoop MapReduce projects can
migrate to Spark and leverage Spark Core as
execution engine
1 You can often reuse your mapper and
reducer functions and just call them in
Spark from Java or Scala
2 You can translate your code from
MapReduce to Apache Spark How-to
Translate from MapReduce to Apache Sparkhttpblogclouderacomblog201409how-to-translate-from-mapreduce-to-
apache-spark
32
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2 Transition
3 The following tools originally based on Hadoop
MapReduce are being ported to Apache Spark
bull Pig Hive Sqoop Cascading Crunch Mahout hellip
33
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Pig on Spark (Spork)
bull Run Pig with ldquondashx sparkrdquo option for an easy migration
without development effort
bull Speed up your existing pig scripts on Spark ( Query
Logical Plan Physical Pan)
bull Leverage new Spark specific operators in Pig such as
Cache
bull Still leverage many existing Pig UDF libraries
bull Pig on Spark Umbrella Jira (Status Passed end-to-end test
cases on Pig still Open) httpsissuesapacheorgjirabrowsePIG-4059
bull Fix outstanding issues and address additional Spark functionality
through the community
bull lsquoPig on Sparkrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag19
34
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull New alternative to using MapReduce or Tez
hivegt set hiveexecutionengine=spark
bull Help existing Hive applications running on
MapReduce or Tez easily migrate to Spark without
development effort
bull Exposes Spark users to a viable feature-rich de facto
standard SQL tool on Hadoop
bull Performance benefits especially for Hive queries
involving multiple reducer stages
bull Hive on Spark Umbrella Jira (Status Open) Q1 2015httpsissuesapacheorgjirabrowseHIVE-7292
35
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Hive on Spark (Currently in Beta
Expected in Hive 110)
bull Design httpblogclouderacomblog201407apache-hive-on-apache-spark-
motivations-and-design-principles
bull Getting StartedhttpscwikiapacheorgconfluencedisplayHiveHive+on+Spark+Getting+Start
ed
bull Hive on Spark February 11 2015 Szehon Ho
Clouderahttpwwwslidesharenettrihugtrihug-feb-hive-on-spark
bull Hive on spark is blazing fast or is it Carter Shanklin and
Mostapah Mokhtar (Hortonworks) February 20 2015httpwwwslidesharenethortonworkshive-on-spark-is-blazing-fast-or-is-it-final
bull lsquoHive on Sparkrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag12
36
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Sqoop on Spark
(Expected in Sqoop 2)
bull Sqoop ( aka from SQL to Hadoop) was initially
developed as a tool to transfer data from RDBMS to
Hadoop
bull The next version of Sqoop referred to as Sqoop2
supports data transfer across any two data sources
bull Sqoop 2 Proposal is still under
discussionhttpscwikiapacheorgconfluencedisplaySQOOPSqoop2+Pro
posal
bull Sqoop2 Support Sqoop on Spark Execution Engine (Jira
Status Work In Progress) The goal of this ticket is to support a
pluggable way to select the execution engine on which we can run
the Sqoop jobs httpsissuesapacheorgjirabrowseSQOOP-1532
37
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
(Expected in 31 release)
bull Cascading httpwwwcascadingorg is an application
development platform for building data applications on
Hadoop
bull Support for Apache Spark is on the roadmap and will be
available in Cascading 31 release
Source httpwwwcascadingorgnew-fabric-support
bull Spark-scalding is a library that aims to make the
transition from CascadingScalding to Spark a little
easier by adding support for Cascading Taps Scalding
Sources and the Scalding Fields API in Spark Sourcehttpscaldingio201410running-scalding-on-apache-spark
38
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Apache Crunch
bull The Apache Crunch Java library provides a
framework for writing testing and running
MapReduce pipelines httpscrunchapacheorg
bull Apache Crunch 011 releases with a
SparkPipeline class making it easy to migrate
data processing applications from MapReduce
to Spark httpscrunchapacheorgapidocs0110orgapachecrunchimplsparkSpark
Pipelinehtml
bull Running Crunch with Spark httpwwwclouderacomcontentclouderaendocumentationcorev5-2-
xtopicscdh_ig_running_crunch_with_sparkhtml
39
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
(Expec (Expected in Mahout 10 )
bull Mahout News 25 April 2014 - Goodbye MapReduce
Apache Mahout the original Machine Learning (ML)
library for Hadoop since 2009 is rejecting new
MapReduce algorithm
implementationshttpmahoutapacheorg
bull Integration of Mahout and Spark
bull Reboot with new Mahout Scala DSL for Distributed
Machine Learning on Spark Programs written in this
DSL are automatically optimized and executed in
parallel on Apache Spark
bull Mahout Interactive Shell Interactive REPL shell for
Spark optimized Mahout DSLhttpmahoutapacheorgusersrecommenderintro-cooccurrence-sparkhtml
40
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
(Expected in Mahout 10 )
bull Playing with Mahouts Spark Shellhttpsmahoutapacheorguserssparkbindingsplay-with-shellhtml
bull Mahout scala and spark bindings Dmitriy Lyubimov
April 2014
httpwwwslidesharenetDmitriyLyubimovmahout-scala-and-spark-bindings
bull Co-occurrence Based Recommendations with
Mahout Scala and Spark Published on May 30 2014
httpwwwslidesharenetsscdotopencooccurrence-based-recommendations-
with-mahout-scala-and-spark
bull Mahout 10 Features by Engine (unreleased)-
MapReduce Spark H2O Flinkhttpmahoutapacheorgusersbasicsalgorithmshtml
41
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
42
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 IntegrationService Open Source Tool
StorageServi
ng Layer
Data Formats
Data
Ingestion
Services
Resource
Management
Search
SQL
43
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Spark was designed to read and write data from and toHDFS as well as other storage systems supported byHadoop API such as your local file system Hive HBaseCassandra and Amazonrsquos S3
bull Stronger integration between Spark and HDFS caching(SPARK-1767) to allow multiple tenants and processingframeworks to share the same in-memoryhttpsissuesapacheorgjirabrowseSPARK-1767
bull Use DDM Discardable Distributed Memoryhttphortonworkscomblogddm to store RDDs in memoryThisallows many Spark applications to share RDDs since theyare now resident outside the address space of theapplication Related HDFS-5851 is planned for Hadoop30 httpsissuesapacheorgjirabrowseHDFS-5851
44
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Out of the box Spark can interface with HBase as it has
full support for Hadoop InputFormats via
newAPIHadoopRDD Example HBaseTestscala from
Spark Code httpsgithubcomapachesparkblobmasterexamplessrcmainscalaorgapach
esparkexamplesHBaseTestscala
bull There are also Spark RDD implementations available
for reading from and writing to HBase without the need
of using Hadoop API anymore Spark-HBase Connector httpsgithubcomnerdammerspark-hbase-connector
bull SparkOnHBase is a project for HBase integration with
Spark Status Still in experimentation and no timetable for
possible support httpblogclouderacomblog201412new-in-cloudera-
labs-sparkonhbase
45
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Spark Cassandra Connector This library lets you
expose Cassandra tables as Spark RDDs write Spark
RDDs to Cassandra tables and execute arbitrary CQL
queries in your Spark applications Supports also
integration of Spark Streaming with Cassandrahttpsgithubcomdatastaxspark-cassandra-connector
bull Spark + Cassandra using Deep The integration
is not based on the Cassandras Hadoop interfacehttpstratiogithubiodeep-spark
bull Getting Started with Apache Spark and Cassandrahttpplanetcassandraorggetting-started-with-apache-spark-and-cassandra
bull lsquoCassandrarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag20-cassandra
46
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Benchmark of Spark amp Cassandra Integration
using different approacheshttpwwwstratiocomdeep-vs-datastax
bull Calliope is a library providing an interface to consume
data from Cassandra to spark and store Resilient
Distributed Datasets (RDD) from Spark to Cassandrahttptuplejumpgithubiocalliope
bull Cassandra storage backend with Spark is opening many new
avenues
bull Kindling An Introduction to Spark with Cassandra
(Part 1) httpplanetcassandraorgblogkindling-an-introduction-to-
spark-with-cassandra
47
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull MongoDB is not directly served by Spark although
it can be used from Spark via an official Mongo-
Hadoop connector
bull MongoDB-Spark Demohttpsgithubcomcrcsmnkymongodb-spark-demo
bull MongoDB and Hadoop Driving Business Insightshttpwwwslidesharenetmongodbmongodb-and-hadoop-driving-business-
insights
bull Spark SQL also provides indirect support via its
support for reading and writing JSON text fileshttpsgithubcommongodbmongo-hadoop
48
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull There is also NSMC Native Spark MongoDB Connector
for reading and writing MongoDB collections directly from
Apache Spark (still experimental)
bull GitHub httpsgithubcomspiromspark-mongodb-connector
bull Using MongoDB with Hadoop amp Spark bull httpswwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-1-
introduction-setup PART 1
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-2-hive-
example Part 2
bull httpwwwmongodbcomblogpostusing-mongodb-hadoop-spark-part-3-spark-
example-key-takeaways PART 3
bull Interesting blog on Using Spark with MongoDB without
Hadoophttptugdualgrallblogspotfr201411big-data-is-hadoop-good-way-to-starthtml
49
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Neo4j is a highly scalable robust (fully ACID) native graph
database
bull Getting Started with Apache Spark and Neo4j Using
Docker Compose By Kenny Bastani March 10 2015
httpwwwkennybastanicom201503spark-neo4j-tutorial-dockerhtml
bull Categorical PageRank Using Neo4j and Apache SparkBy Kenny Bastani January 19 2015
httpwwwkennybastanicom201501categorical-pagerank-neo4j-sparkhtml
bull Using Apache Spark and Neo4j for Big Data Graph
Analytics By Kenny Bastani November 3 2014
httpwwwkennybastanicom201411using-apache-spark-and-neo4j-for-bightml
50
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration YARN
bull YARN Yet Another Resource Negotiator Implicit
reference to Mesos as the Resource Negotiator
bull Integration still improving httpsissuesapacheorgjiraissuesjql=project203D20SPARK20AND
20summary20~20yarn20AND20status203D20OPEN20ORDER20
BY20priority20DESC0A
bull Some issues are critical ones
bull Running Spark on YARNhttpsparkapacheorgdocslatestrunning-on-yarnhtml
bull Get the most out of Spark on YARNhttpswwwyoutubecomwatchv=Vkx-TiQ_KDU
51
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Spark SQL provides built in support for Hivetables
bull Import relational data from Hive tables
bull Run SQL queries over imported data
bull Easily write RDDs out to Hive tables
bull Hive 013 is supported in Spark 120
bull Support of ORCFile (Optimized Row Columnarfile) format is targeted in Spark 130 Spark-2883httpsissuesapacheorgjirabrowseSPARK-2883
bull Hive can be used both for analytical queries andfor fetching dataset machine learning algorithmsin MLlib
52
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Drill is intended to achieve the sub-second latency
needed for interactive data analysis and exploration httpdrillapacheorg
bull Drill and Spark Integration is work in progress in 2015 to
address new use cases
bull Use a Drill query (or view) as the input to Spark Drill
extracts and pre-processes data from various data
sources and turns it into input to Spark
bull Use Drill to query Spark RDDs Use BI tools to query
in-memory data in Spark Embed Drill execution in a
Spark data pipeline
Source Whats Coming in 2015 for
Drillhttpdrillapacheorgblog20141216whats-coming-in-2015
53
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Apache Kafka is a high throughput distributed
messaging system httpkafkaapacheorg
bull Spark Streaming integrates natively with Kafka
Spark Streaming + Kafka Integration Guidehttpsparkapacheorgdocslateststreaming-kafka-integrationhtml
bull Tutorial Integrating Kafka and Spark Streaming
Code Examples and State of the Game httpwwwmichael-nollcomblog20141001kafka-spark-streaming-integration-
example-tutorial
bull lsquoKafkarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag24-kafka
54
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Apache Flume is a streaming event data
ingestion system that is designed for Big Data
ecosystem httpflumeapacheorg
bull Spark Streaming integrates natively with
Flume There are two approaches to this
bull Approach 1 Flume-style Push-based Approach
bull Approach 2 (Experimental) Pull-based
Approach using a Custom Sink
bull Spark Streaming + Flume Integration Guidehttpssparkapacheorgdocslateststreaming-flume-integrationhtml
55
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Spark SQL provides built in support for JSON that
is vastly simplifying the end-to-end-experience of
working with JSON data
bull Spark SQL can automatically infer the schema
of a JSON dataset and load it as a
SchemaRDD No more DDL Just point Spark
SQL to JSON files and query Starting Spark 13
SchemaRDD will be renamed to DataFrame
bull An introduction to JSON support in Spark SQL February 2 2015 httpdatabrickscomblog20150202an-introduction-to-json-
support-in-spark-sqlhtml
56
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Apache Parquet is a columnar storage formatavailable to any project in the Hadoop ecosystemregardless of the choice of data processingframework data model or programming languagehttpparquetincubatorapacheorg
bull Built in support in Spark SQL allows to
bull Import relational data from Parquet files
bull Run SQL queries over imported data
bull Easily write RDDs out to Parquet fileshttpsparkapacheorgdocslatestsql-programming-guidehtmlparquet-files
bull This is an illustrating example of integration ofParquet and Spark SQLhttpwwwinfoobjectscomspark-sql-parquet
57
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Spark SQL Avro Library for querying Avro data with Spark SQL This library requires Spark 12+httpsgithubcomdatabricksspark-avro
bull This is an example of using Avro and Parquet in Spark SQLhttpwwwinfoobjectscomspark-with-avro
bull AvroSpark Use casehttpwwwslidesharenetDavidSmelkerbdbdug-data-types-jan-2015
bull Problem
bull Various inbound data sets
bull Data Layout can change without notice
bull New data sets can be added without notice
Result
bull Leverage Spark to dynamically split the data
bull Leverage Avro to store the data in a compact binary format
58
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration Kite SDK
bull The Kite SDK provides high level abstractions to
work with datasets on Hadoop hiding many of
the details of compression codecs file formats
partitioning strategies etchttpkitesdkorgdocscurrent
bull Spark support has been added to Kite 016
release so Spark jobs can read and write to Kite
datasets
bull Kite Java Spark Demohttpsgithubcomkite-sdkkite-examplestreemasterspark
59
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Elasticsearch is a real-time distributed search and analytics
engine httpwwwelasticsearchorg
bull Apache Spark Support in Elasticsearch was added in 21httpwwwelasticsearchorgguideenelasticsearchhadoopmastersparkhtml
bull Deep-Spark provides also an integration with SparkhttpsgithubcomStratiodeep-spark
bull elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark in the form of RDD that can
read data from Elasticsearch Also any RDD can be saved to
Elasticsearch as long as its content can be translated into
documents httpsgithubcomelasticelasticsearch-hadoop
bull Great use case by NTT Data integrating Apache
Spark Streaming and Elasticsearchhttpwwwintellilinkcojparticlecolumnbigdata-kk02html
60
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull Apache Solr added a Spark-based indexing tool for
fast and easy indexing ingestion and serving
searchable complex data ldquoCrunchIndexerTool on
Sparkrdquo
bull Solr-on-Spark solution using Apache Solr Spark
Crunch and Morphlines
bull Migrate ingestion of HDFS data into Solr from
MapReduce to Spark
bull Update and delete existing documents in Solr at scale
bull Ingesting HDFS data into Solr using Sparkhttpwwwslidesharenetwhoschekingesting-hdfs-
intosolrusingsparktrimmed
61
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Integration
bull HUE is the open source Apache Hadoop Web UI
that lets users use Hadoop directly from their
browser and be productive httpwwwgethuecom
bull A Hue application for Apache Spark called Spark
Igniter lets users execute and monitor Spark jobs
directly from their browser and be more
productive
bull Demo of Spark Igniter httpvimeocom83192197
bull Big Data Web applications for Interactive Hadoophttpsspeakerdeckcombigdataspainbig-data-web-applications-for-interactive-
hadoop-by-enrico-berti-at-big-data-spain-2014
62
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
III Spark with Hadoop
1 Evolution
2 Transition
3 Integration
4 Complementarity
5 Key Takeaways
63
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity
Components of Hadoop ecosystem and Spark ecosystem
can work together each for what it is especially good at
rather than choosing one of them
64
Hadoop ecosystem Spark ecosystem
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity + +
bull Tachyon is an in-memory distributed file system By storing the file-system contents in the main memory of all cluster nodes the system achieves higher throughput than traditional disk-based storage systems like HDFS
bull The Future Architecture of a Data Lake In-memory Data Exchange Platform Using Tachyon and Apache Spark (October 14 2014)httpblogpivotaliobig-data-pivotalnews-2the-future-architecture-of-a-data-lake-in-memory-data-exchange-platform-using-tachyon-and-apache-spark
bull Spark and in-memory databases Tachyon leading the pack January 2015httpdynresmanagementcom1post201501spark-and-in-memory-databases-tachyon-leading-the-packhtml
65
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity +
bull Mesos and YARN can work together each for what it is especially good at rather than choosing one of the two for Spark deployment
bull Big data developers get the best of YARNrsquos power for Hadoop-driven workloads and Mesosrsquo ability to run any other kind of workload including non-Hadoop applications like Web applications and other long-running servicesrdquo
bull Project Myriad is an open source framework for running YARN on Mesos
bull lsquoMyriadrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag41
66
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity +
References
bull Apache Mesos vs Apache Hadoop YARN httpswwwyoutubecomwatchv=YFC4-gtC19E
bull Myriad A Mesos framework for scaling a YARN
cluster httpsgithubcommesosmyriad
bull Myriad Project Marries YARN and Apache
Mesos Resource Management httpostaticcomblogmyriad-project-marries-yarn-and-apache-mesos-
resource-management
bull YARN vs MESOS Canrsquot We All Just Get
Along httpstrataconfcombig-data-conference-ca-
2015publicscheduledetail40620
67
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity +
bull Spark on Tez for efficient ETL httpsgithubcomhortonworksspark-native-yarn
bull Tez could takes care of the pure Hadoop optimization
strategies (building the DAG with knowledge of data
distribution statistics orhellip HDFS caching)
bull Spark execution layer could be leveraged without the
need of a nasty SparkHadoop coupling
bull Tez is good on fine-grained resource isolation with
YARN (resource chaining in clusters)
bull Tez supports enterprise security
68
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity +
bull Data gtgt RAM Processing huge data volumes
much bigger than cluster RAM Tez might be better
since it is more ldquostream orientedrdquo has more mature
shuffling implementation closer YARN integration
bull Data ltlt RAM Since Spark can cache in memory
parsed data it can be much better when we process
data smaller than clusterrsquos memory
bull Improving Spark for Data Pipelines with Native
YARN Integration httphortonworkscomblogimproving-spark-data-
pipelines-native-yarn-integration
bull Get the most out of Spark on YARN httpswwwyoutubecomwatchv=Vkx-TiQ_KDU
69
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity
bull Emergence of the lsquoSmart Execution Enginersquo Layer
Smart Execution Engine dynamically selects the optimal
compute framework at each step in the big data
analytics process based on the type of platform the
attributes of the data and the condition of the cluster
bull Matt Schumpert on Datameer Smart Execution Engine httpwwwinfoqcomarticlesdatameer-smart-execution-engine Interview on
November 13 2014 with Matt Schumpert Director of Product
Management at Datameer
bull The Challenge to Choosing the ldquoRightrdquo Execution
Engine By Peter Voss | September 30 2014 httpwwwdatameercomblogannouncementsthe-challenge-to-choosing-the-
right-execution-enginehtml
70
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 Complementarity
bull Operating in a Multi-execution Engine Hadoop Environment by
Erik Halseth of Datameer on January 27th 2015 at the Los Angeles
Big Data Users Group
bull httpfilesmeetupcom12753252LA Big Data Users Group Presentation Jan-27-
2015pdf
bull New Syncsort Big Data Software Removes Major Barriers to
Mainstream Apache Hadoop Adoption February 12 2015
httpwwwitbusinessnetcomarticleNew-Syncsort-Big-Data-Software-
Removes-Major-Barriers-to-Mainstream-Apache-Hadoop-Adoption-3749366
bull Syncsort Automates Data Migrations Across Multiple Platforms
February 23 2015
httpwwwitbusinessedgecomblogsit-unmaskedsyncsort-automates-data-
migrations-across-multiple-platformshtml
bull Framework for the Future of Hadoop March 9 2015
httpblogsyncsortcom201503framework-future-hadoop
71
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
5 Key Takeaways1 Evolution of compute models is still ongoing
Watch out Apache Flink project for true low-latency and iterative use cases
2 Transition Tools from the Hadoop ecosystem are still being ported to Spark Keep watching general availability and balance risk and opportunity
3 Integration Healthy dose of Hadoop ecosystem integration with Spark More integration is on the way
4 Complementarity Components and tools from Hadoop ecosystem and Spark ecosystem can work together each for what it is especially good at One size doesnrsquot fit all
72
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
73
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 File System
Spark does not require HDFS Hadoop Distributed File System Your lsquoBig Datarsquo use case might be implemented without HDFS For example
1 Use Spark to process data stored in Cassandra File System (DataStax CassandraFS) or MongoDB File System (GridFS)
2 Use Spark to read and write data directly to a messaging system like Kafka if your use case doesnrsquot need data persistence Example httptechblognetflixcom201503can-spark-streaming-survive-chaos-monkeyhtml
3 Use an In-Memory distributed File System such as Sparkrsquos cousin Tachyon httpsparkbigdatacomcomponenttagstag13
4 Use a Non-HDFS file systemrsquo already supported by Spark bull Amazon S3
bull httpdatabricksgitbooksiodatabricks-spark-reference-applicationscontentlogs_analyzerchapter2s3html
bull MapR-FSbull httpswwwmaprcomblogcomparing-mapr-fs-and-hdfs-nfs-and-snapshots
5 OpenStack Swift (Object Store)bull httpssparkapacheorgdocslateststorage-openstack-swifthtml
bull httpswwwopenstackorgsummitopenstack-paris-summit-2014session-videospresentationthe-perfect-match-apache-spark-meets-swift
74
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
1 File System
When coupled with its analytics capabilities file-system agnostic Spark can only re-ignite this discussion of HDFS alternatives Because Hadoop isnrsquot perfect 8 ways to replace HDFS July 11 2012 httpsgigaomcom20120711because-hadoop-isnt-perfect-8-ways-to-replace-hdfs
A few HDFS alternatives to choose from include bull Apache Spark on Mesos running on CoreOS and using EMC ECS
HDFS storage March 9 2015httpwwwrecorditblogcompostapache-spark-on-mesos-running-on-coreos-and-using-emc-ecs-hdfs-storage
bull Lustre File System - Intel Enterprise Edition for Lustre (IEEL) (Upcoming support)httpinsidebigdatacom20141002replacing-hdfs-lustre-maximum-performance
bull Quantcast QFS httpswwwquantcastcomengineeringqfs
bull hellip
75
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
76
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
2 Deployment
While Spark is most often discussed as a replacement for MapReducein Hadoop clusters to be deployed on YARN Spark is actuallyagnostic to the underlying infrastructure for clustering soalternative deployments are possible
1 Local httpsparkbigdatacomtutorials51-deployment121-local
2 Standalone httpsparkbigdatacomtutorials51-deployment123-standalone
3 Apache Mesos httpsparkbigdatacomtutorials51-deployment122-mesos
4 Amazon EC2 httpsparkbigdatacomtutorials51-deployment124-amazon-ec2
5 Amazon EMR httpsparkbigdatacomtutorials51-deployment127-amazon-emr
6 Rackspace httpsparkbigdatacomtutorials51-deployment138-on-rackspace
7 Google Cloud Platformhttpsparkbigdatacomtutorials51-deployment139-google-cloud
8 HPC Clusters
bull Setting up Spark on top of SunOracle Grid Engine (PSI) -httpsparkbigdatacomtutorials51-deployment126-sun-oracle-grid-engine-sge
bull Setting up Spark on the Brutus and Euler Clusters (ETH) -httpsparkbigdatacomtutorials51-deployment128-hpc-cluster
77
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
6 Key Takeaways
78
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
3 Distributions
bull Using Spark on a Non-Hadoop distribution
79
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Cloud
bull Databricks Cloud is not dependent on
Hadoop It gets its data from Amazonrsquos S3
(most commonly) Redshift Elastic MapReducehttpsdatabrickscomproductdatabricks-cloud
bull Databricks Cloud From raw data to insights and
data products in an instant March 4 2015
httpsdatabrickscomblog20150304databricks-cloud-from-raw-data-to-
insights-and-data-products-in-an-instanthtml
bull Databricks Cloud Announcement and Demo at
Spark Summit 2014 July 2 2014
httpswwwyoutubecomwatchv=dJQ5lV5Tldw
80
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
DSE
bull DSE DataStax Enterprise built on Apache Cassandra
presents itself as a Non-Hadoop Big Data Platform
Data can be stored in Cassandra File System httpwwwdatastaxcomdocumentationdatastax_enterprise46datastax_enter
prisesparksparkTOChtml
bull Escape from Hadoop Ultra Fast Data Analysis with
Spark amp Cassandra Piotr Kolaczkowski September 26 2014
httpwwwslidesharenetPiotrKolaczkowskifast-data-analysis-with-spark-4
bull Escape from Hadoop with Apache Spark and
Cassandra with the Spark Cassandra Connector
Helena Edelson published on November 24 2014
httpwwwslidesharenethelenaedelsonescape-from-hadoop-with-apache-
spark-and-cassandra-41950082
81
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
bull Stratio is a Big Data platform based on Spark Itis 100 open source and enterprise readyhttpwwwstratiocom
bull Streaming-CEP-Engine Streaming CEP engineis a Complex Event Processing platform builton Spark Streaming It is the result of combiningthe power of Spark Streaming as a continuouscomputing framework and Siddhi CEP engine ascomplex event processing enginehttpstratiogithubiostreaming-cep-engine
bull lsquoStratiorsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag40
82
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
83
bull xPatterns (httpatigeocomtechnology) is a complete big
data analytics platform available with a novel
architecture that integrates components across
three logical layers Infrastructure Analytics
and Applications
bull xPatterns is cloud-based exceedingly scalable
and readily interfaces with existing IT systems
bull lsquoxPatternsrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
39
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
84
bull The BlueData (httpwwwbluedatacom) EPIC software
platform solves the infrastructure challenges and
limitations that can slow down and stall Big Data
deployments
bull With EPIC software you can spin up Hadoop
clusters ndash with the data and analytical tools that
your data scientists need ndash in minutes rather than
months httpswwwyoutubecomwatchv=SE1OP4ImrxU
bull lsquoBlueDatarsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag37
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
85
bull Guavus (httpwwwguavuscom) embeds Apache Spark into
its Operational Intelligence Platform Deployed at the
Worldrsquos Largest Telcos September 25 2014 by Eric Carr httpdatabrickscomblog20140925guavus-embeds-apache-spark-into-its-
operational-intelligence-platform-deployed-at-the-worlds-largest-telcoshtml
bull Guavus operational intelligence platform analyzes
streaming data and data at rest
bull The Guavus Reflex 20 platform is commercially
compatible with open source Apache Sparkhttpinsidebigdatacom20140926guavus-databricks-announce-reflex-
platform-now-certified-spark-distribution
bull lsquoGuavusrsquo Tag at SparkBigDatacom httpsparkbigdatacomcomponenttagstag38
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark without Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
86
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
4 AlternativesHadoop Ecosystem Spark Ecosystem
Component
HDFS Tachyon
YARN Mesos
Tools
Pig Spark native API
Hive Spark SQL
Mahout MLlib
Storm Spark Streaming
Giraph GraphX
HUE Spark NotebookISpark
87
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
bull Tachyon is a memory-centric distributed file
system enabling reliable file sharing at memory-
speed across cluster frameworks such as Spark
and MapReduce httpshttptachyon-projectorg
bull Tachyon is Hadoop compatible Existing Spark
and MapReduce programs can run on top of it
without any code change
bull Tachyon is the storage layer of the Berkeley
Data Analytics Stack (BDAS) httpsamplabcsberkeleyedusoftware
88
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
bull Mesos (httpmesosapacheorg) enables fine
grained sharing which allows a Spark job to dynamically
take advantage of the idle resources in the cluster during
its execution This leads to considerable performance
improvements especially for long running Spark jobs
bull Mesos as Data Center ldquoOSrdquo
bull Share datacenter between multiple cluster computing
apps Provide new abstractions and services
bull Mesosphere DCOS Datacenter services including
Apache Spark Apache Cassandra Apache YARN
Apache HDFShellip
bull lsquoMesosrsquo Tag at SparkBigDatacomhttpsparkbigdatacomcomponenttagstag16-mesos
89
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
YARN vs MesosCriteria
Resource
sharing
Yes Yes
Written in Java C++
Scheduling Memory only CPU and Memory
Running tasks Unix processes Linux Container groups
Requests Specific requests
and locality
preference
More generic but more
coding for writing
frameworks
Maturity Less mature Relatively more mature
90
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Spark Native API
bull Spark Native API in Scala Java and Python
bull Interactive shell in Scala and Python
bull Spark supports Java 8 for a much more concise
Lambda expressions to get code nearly as
simple as the Scala API
bull ETL with Spark - First Spark London Meetup May 28 2014
httpwwwslidesharenetrafalkwasnyetl-with-spark-first-spark-london-
meetup
bull lsquoSpark Corersquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag
11-core-spark
91
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Spark SQL
bull Spark SQL is a new SQL engine designed from
ground-up for Spark httpssparkapacheorgsql
bull Spark SQL provides SQL performance and maintains
compatibility with Hive It supports all existing Hive data
formats user-defined functions (UDF) and the Hive
metastore
bull Spark SQL also allows manipulating (semi-) structured
data as well as ingesting data from sources that
provide schema such as JSON Parquet Hive or
EDWs It unifies SQL and sophisticated analysis
allowing users to mix and match SQL and more
imperative programming APIs for advanced analytics
92
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Spark MLlib
93
lsquoSpark MLlib rsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponenttagstag5-mllib
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Spark Streaming
94
lsquoSpark Streaming rsquo Tag at httpsparkbigdatacomcomponenttagstag3-
spark-streaming
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Storm vs Spark StreamingCriteria
Processing Model Record at a time Mini batches
Latency Sub second Few seconds
Fault tolerancendash
every record
processed
At least one ( may
be duplicates)
Exactly one
Batch Framework
integration
Not available Core Spark API
Supported
languages
Any programming
language
Scala Java
Python
95
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
GraphX
96
lsquoGraphXrsquo Tag at
SparkBigDatacomhttpsparkbigdatacomcomponent
tagstag6-graphx
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Notebook
97
bull Zeppelin httpzeppelin-projectorg is a web-based
notebook that enables interactive data analytics
Has built-in Apache Spark support
bull Spark Notebook is an interactive web-based
editor that can combine Scala code SQL
queries Markup or even JavaScript in a
collaborative manner httpsgithubcomandypetrellaspark-
notebook
bull ISpark is an Apache Spark-shell backend for
IPython httpsgithubcomtribbloidISpark
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV Spark on Non-Hadoop
1 File System
2 Deployment
3 Distributions
4 Alternatives
5 Key Takeaways
98
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
6 Key Takeaways1 File System Spark is File System AgnosticBring Your Own Storage
2 Deployment Spark is Cluster InfrastructureAgnostic Choose your deployment
3 Distributions You are no longer tied to Hadoopfor Big Data processing Spark distributions asservice in the cloud or imbedded in Non-Hadoopdistributions are emerging
4 Alternatives Do your due diligence based onyour own use case and research pros and consbefore picking a specific tool or switching from onetool to another
99
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
IV More QampA
100
httpwwwSparkBigDatacom
sbaltagigmailcom
httpswwwlinkedincominslimbal
tagi
SlimBaltagi
httpwwwslidesharenetsbaltagi
Top Related