Final deck

35
Big Data for Everyone Twitter: #bd4e

description

SXSW 2011 Panel Presentation - "Big Data For Everyone (No Data Scientists Required)

Transcript of Final deck

Page 1: Final deck

Big Data for Everyone

Twitter: #bd4e

Page 2: Final deck

Introduction to Big Data

Steve Watt Hadoop Strategy

@wattsteve #bd4e

Page 3: Final deck

3

What is “Big Data”?

“Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google

Current state of affairs: Explosion of user generated content Storage is really cheap so we can store what we want Traditional data stores have reached critical mass

Issues: Enterprise Amnesia Traditional architectures become brittle and slow when

tasked with trying to process data at petabyte scale How do we process unstructured data?

Page 4: Final deck

4

How were these issues addressed?

2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale

The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements

This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance.

The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project.

Page 5: Final deck

5

What is Apache Hadoop ?

It is a cluster technology with a single master and multiple slaves, designed for commodity hardware

It consists of two runtimes, the Hadoop distributed file system (HDFS) and Map/Reduce

As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy

Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine (data locality).

Hadoop may execute or re-execute a job on any node in the cluster.

Node failures are automatically handled by the framework.

Page 6: Final deck

6

The Big Data Ecosystem

ClusterChef / Apache Whirr / EC2

Hadoop

Pig / WuKong /Cascading

Cassandra / HBase

Offline Systems (Analytics) Human Consumption

BigSheets / DataMeer

Hive / Karmasphere

Provisioning

Nutch / SQOOP / Flume

Scripting

DBA

Non-Programmer

Import/Export Tooling

Visualizations

Online Systems

(OLTP @ Scale)

NoSQL

Commodity Hardware

Page 7: Final deck

Offline customer scenario

Eric Sammer Solution Architect

@esammer #bd4e

Page 8: Final deck

Use Case: Product Recommendations

“We can provide a better experience (and make more money) if we provide meaningful product recommendations.”

We need data:

- What products did a user buy?

- What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months?

- What are the attributes of the user? (e.g. income, gender, friends)

- What are our margins on products, inventory, upcoming promotions?

Page 9: Final deck

Problems

That’s a lot of data! (2 months of activity + all purchase data + all user data) Activity: ~20GB per day x ~60 days = 1.2TB User Data: ~2GB Purchase Data: ~5GB Misc: Inventory, product costs, promotion schedules

Distilling data to aggregates would reduce fidelity.

Easy to see how looking at more data could improve recommendations.

How do we keep this information current?

Page 10: Final deck

The Answer

Calculate all qualifying products once a day for each user and store them for quick display

Use Hadoop to process data in parallel on hundreds of machines

Page 11: Final deck
Page 12: Final deck

Online customer scenario

Matt Pfeil CEO

@mattz62 #bd4e

Page 13: Final deck

04/11/23 13

What is Apache Cassandra?

Page 14: Final deck

Use Case: Managing Email

“My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.”

The Problem

How do we achieve scalability, redundancy, high performance?

How do we store billions of files on commodity hardware?

How do we increase capacity by simply adding machines? (No SANs!)

How do we make it FAST?

Page 15: Final deck

Requirements

Storage for Email Billions of emails (<100KB avg) 2M users, 100 MB of storage each = 190 TB Growth of 50% every 6 months Durable

Requirements for Storage System No Master/Single Point of Failure Linear scalability + redundancy Multiple Active Data Centers Many reads, many writes Millisecond response times Commodity hardware

Page 16: Final deck

Solution

800 TB of Storage

~1.75 Million reads or writes/sec (No Cache!)

130 Machines Read/Write at both Data

Centers No “Master” Data Center

Page 17: Final deck

Where to next?The Adjacent Possible

Flip Kromer CTO

@mrflip #bd4e

Page 18: Final deck

Something about AnythingSomething about Anything

Page 19: Final deck

Everything about SomethingEverything about Something

Page 20: Final deck

Bigger than One Computer

Page 21: Final deck

Bigger than Frontal Lobe

Page 22: Final deck

Bigger than Excel

Page 23: Final deck

what’s coming to help

Page 24: Final deck

myth of the “data base”

Hold your data

Ask questions

Page 25: Final deck

Managing & Shipping

Hadoop FTWCassandra, HBase, ElasticSearch, ...

Integration is still too hard

Dev OpsReliable Decoupling: Flume, Graphite

Page 26: Final deck

Data flutters by label

Elephants make sturdy piles {GROUP}

Number becomes thoughtprocess_group

Hadoop

Page 27: Final deck

class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)

Twitter Parser in a Tweet

Page 28: Final deck

pure functionality

Page 29: Final deck

pure functionality

Page 30: Final deck

Cassandra

HBase

ElasticSearch

MySQL

Redis

TokyoTyrant

SimpleDB

MongoDB

sqlite

whisper (graphite)

file system

S3

Data Stores in Production

Page 31: Final deck

Cassandra

HBase

ElasticSearch

MySQL

Redis

TokyoTyrant

SimpleDB

MongoDB

sqlite

whisper (graphite)

file system

S3

Dev Ops: Rethink Hard

Page 32: Final deck

Still Blind

Visual Grammar to see it: NYTimes, Stamen, Ben Fry

Interactive tools: Tableau, Spotfirebloom.io, d3.js, Gephi

Page 33: Final deck

Human-Scale Tools

Data-as-a-Service: Infochimps, SimpleGeoDrawnToScale

Business IntelligenceFamiliar Paradigm, New Scale

BigSheets, Datameer

Page 34: Final deck

Panel Discussion

Stu Hood Software Engineer

@stuhood #bd4e

Page 35: Final deck

Thanks for coming!Stu Hood @stuhood Flip Kromer @mrflipMatt Pfeil @mattz62Eric Sammer @esammerSteve Watt @wattsteve