Clear story _spark_

clearstorydata.com

Using Spark and Shark for

Fast Cycle Analysis on Diverse Data

12.2.13

Vaibhav Nivargi

clearstorydata.com

About ClearStory Data

clearstorydata.com

Analysis in the New Data Landscape

New use cases seen in all industries.

• Live situational analysis requiring fast-cycle

analysis across internal data and sources of

external data

• Multi-source analysis with data refreshing on

new insights, as data from sources evolves

• Large-scale analysis of structured and

unstructured data combined in integrated

insights

clearstorydata.com

Example: Interactive Multi-source Analysis

More data and more people change the analysis.

FacebookShares, Likes, Comments

News Coverage

Online, Print, Television

TwitterFollowers,

Tweets, Retweets

DonationsNew Members,

Donations

Website TrafficTraffic,

Referrals, Content

Data Intelligence

Interactive analysis on diverse

internal & external data

Corporate SponsorsCorporate

Engagement, New Inquiries

clearstorydata.com

Today’s Need is Speed, Scale & Ad Hoc Flexibility

With more sources, more data and more people.

? ?

??

clearstorydata.com

Why Spark and Shark ?

• RDDs– Low latency & scale

– Iterative and Interactive computation

• Lineage and fault tolerance– Able to re-derive data

• Expressive power of Scala and SQL– Operations beyond aggregations, joins, and statistical operators

– Advanced: ML, data mining, segmentation, approximate queries, graphs …

• Support for structured and semi-structured data

• BDAS Stack & AMPLab– Tachyon, MLBase, BlinkDB, GraphX …

• Community and adoption

clearstorydata.com

Data Sources ClearStory Platform ClearStory Application

The ClearStory Solution

Data Inference & Profiling

Harmonization

Visualization

Collaboration

In-MemoryData Units

clearstorydata.com

Public PremiumWebRDBMS Hadoop

ClearStory API

User Application

Data Access, Inference and Lineage

Data Source API

Files

Spark Cluster + ClearStory IP

Harmonization Engine and Blended Data Processing

Where do Spark & Shark fit ?

clearstorydata.com

How we leverage Spark & Shark

• User intent captured and translated to custom API

• Harmonization-as-a-Service

• Manages Spark and Shark query execution

• Read cached data from HDFS

• RESTful

• Merges datasets (RDDs) on the fly – on user request

• Support conversion of user actions to backend queries

• Query optimizations

• Performance optimizations

• Mixed-mode execution (sql2rdd & spark native)

• Caching

• Pre-computation

clearstorydata.com

How we leverage Spark & Shark

• Query results returned to the application for

scalable visualization and ClearStory-specific viz

techniques

• RDDs cached/un-cached and materialized at

strategic points based on usage patterns and

signals

• Data updates automatically processed as source

data changes

• ClearStory’s own deployment, packaging, and

integrated monitoring for operations at scale

clearstorydata.com

Spark Developments – What We Like

• Query cancellation, progress indication (0.8.1 and

beyond)

• More performance breakthroughs

• Workload Management

• BlinkDB

• MLBase

• Tachyon

• GraphX

clearstorydata.com

We’re Hiring!

• Working with the community, giving back

• Lots of exciting new developments

• This is like the early days of Hadoop – massive

momentum gathering

The First Spark Summit!

More Meet-ups!

clearstorydata.com

Clear story _spark_

Technology

Transcript of Clear story _spark_