How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

How Companies areUsing SparkAnd where the Edge in Big Data will be

Matei Zaharia

HistoryDecreasing storage costs have led to an explosion of big data

Commodity cluster software, like Hadoop, has made it 10-20x cheaper to store large datasets

Broadly available from multiple vendors

ImplicationBig data storage is becoming commoditized, so how will organizations get an edge?

What matters now is what you can do with the data.

Two FactorsSpeed: how quickly can you go from data to decisions?

Sophistication: can you run the best algorithms on the data?

These factors have usually required separate,non-commodity tools

Apache SparkA compute engine for Hadoop data that is:

Fast: up to 100x faster than MapReduce

100 90

Spark (disk)

Spark (RAM)

SQL performance

Sophisticated: can run today’smost advanced algorithms

Apache Spark

Spark Streami

ngreal-time

SharkSQL

MLlibmachine learning

GraphXgraph

Fully open source: one of mostactive projects in big data

Giraph StormTez

Contributors in past year

Giraph StormTez

Contributors in past year

Fully open source: one of mostactive projects in big data

Spark brings top-end data analysis tocommodity Hadoop clusters

Spark Use Cases

1. Yahoo! PersonalizationYahoo! properties are highly personalized to maximize relevance

Reaction must be fast, as stories, etc change in time

Best algorithms are highly sophisticated

1. Yahoo! PersonalizationExample challenge: relevance of news stories

Relevance models must be updated throughout the day

1. Yahoo! PersonalizationSpark at Yahoo!

» Runs in Hadoop YARN to use existing data & clusters

Result: pilot for stream ads

» 120 lines in Scala, compared to 15K in C++

» 30 min to run on 100 million samples

Major contributor on YARN

support, scalability, operations

Hadoop:Batch

Processing

Spark:Iterative

Processing…

YARN:Resource Manager

Storage:HDFS, HBase, etc

2. Yahoo! Ad AnalyticsYahoo! Ads wanted interactive BI on terabytes of data

Chose Shark (Hive on Spark) to provide this through standard Hive server API + Tableau

Result: interactive-speed querieson terabytes from Tableau

Major contributor on columnar compression,

statistics, JDBC

Historical DW (HDFS)

Hadoop MR(Pig, Hive, MR)

Large Hadoop Cluster

SatelliteShark Cluster

3. Conviva Real-Time Video OptimizationConviva manages 4+ billion video streams per month

Dynamically selects sources to optimize quality

Time is critical: 1 second buffering = lost viewers

3. Conviva Real-Time Video OptimizationUsing Spark Streaming, Conviva learns network conditions in real time

Results fed directly to video players to optimize streams

System running in production

Decision Maker

Storage Layer

4. ClearStory Data: Multi-source, Fast-cycle Analysis Same-day results from data updating at disparate sources

Dozens of disparate sources converged in seconds/minutes

clearstorydata.com

Data Sources ClearStory PlatformClearStory Application

Data Inference & Profiling

Harmonization

Visualization

Collaboration

In-MemoryData Units

4. ClearStory Data: Multi-source, Fast-cycle Analysis

Get StartedDownload and resources: spark.incubator.apache.org

Free video tutorials: spark-summit.org/2013

Commercial support:

ConclusionBig data will be standard: everyone will have it

Organizations will gain an edge through speed of action and sophistication of analysis

Apache Spark brings these to Hadoop clusters

How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Documents

Transcript of How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

In#Memory*ClusterComputing*for · Spark In#Memory*ClusterComputing*for Iterative*and*Interactive*Applications* Matei*Zaharia,*Mosharaf*Chowdhury,Tathagata Das,* Ankur*Dave,*Justin*Ma,*Murphy*McCauley,*

High-Speed In-Memory Analytics over Hadoop and …poloclub.gatech.edu/cse6242/2017spring/slides/CSE6242-12...Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (NCR) Spark &

Advanced Big Data Systems - GitHub Pagesstg-tud.github.io/ctbd/2017/CTBD_09_spark.pdf · 2019-06-18 · Spark •Project start - UC Berkeley, 2009 •Matei Zaharia et al. Spark: Cluster

Matei Zaharia Spark Community Update. An Exciting Year for Spark May 2013May 2014 developers contributing 60200 companies contributing 1750 total lines.

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia ... · Slides adopted from Matei Zaharia (Stanford) and Oliver Vagner (NCR) Spark & Spark SQL High-Speed In-Memory Analytics

Advanced(Spark Features( - UC Berkeley AMP Campampcamp.berkeley.edu/.../2012/06/...advanced-spark.pdf · Matei&Zaharia& & UC&Berkeley& & && Advanced(Spark Features(UC&BERKELEY&

Spark In-Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy.

Spark Streaming Large-scale near-real-time stream processing Tathagata Das (TD) along with Matei Zaharia, Haoyuan Li, Timothy Hunter, Scott Shenker, Ion.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

MapReduce & Spark - Paris Dauphine Universitycolazzo/BD/partie2.1.pdf · • Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia,

Advanced Big Data Systems · Spark •Project start - UC Berkeley, 2009 •Matei Zaharia et al. Spark: Cluster Computing with Working Sets,. HotCloud 2010. •Matei Zaharia et al.

Making Big Data Processing Simple with Spark · Making Big Data Processing Simple with Spark Matei Zaharia December 17, 2015 . ... • High-level APIs in Java, Scala, Python, R ...

New Directions for Spark in 2015 - Spark Summit · PDF fileNew Directions for Spark in 2015 Matei Zaharia March 18, 2015 . ... Daytona GraySort benchmark, sortbenchmark.org ... {JSON}

Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.

Apache Spark: The New Language Of Analytics · 2018-11-04 · 11 / 19 Spark Is A Data Scientist’s Dream Invented by Canadian computer scientist Matei Zaharia, Apache Spark’s a

Zaharia spark-scala-days-2012

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia

Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * ,

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

In#MemoryClusterComputingfor · Spark In#MemoryClusterComputingfor IterativeandInteractiveApplications MateiZaharia,MosharafChowdhury,Tathagata Das, AnkurDave,JustinMa,MurphyMcCauley,