Final deck

Big Data for Everyone

Twitter: #bd4e

Introduction to Big Data

Steve Watt Hadoop Strategy

@wattsteve #bd4e

3

What is “Big Data”?

“Every two days we create as much information as we did from the dawn of civilization up until 2003” – Eric Schmidt, Google

Current state of affairs: Explosion of user generated content Storage is really cheap so we can store what we want Traditional data stores have reached critical mass

Issues: Enterprise Amnesia Traditional architectures become brittle and slow when

tasked with trying to process data at petabyte scale How do we process unstructured data?

4

How were these issues addressed?

2004 – Google publishes seminal whitepapers on Map/Reduce and the Google File System, a new programming paradigm to process data at Internet Scale

The whitepapers describe the use of Massive Parallelism to allow a system to scale horizontally, achieving linear performance improvements

This approach is well suited a cloud model whereby additional instances can be commissioned/de-commisioned to have an immediate effect on performance.

The approaches described in the Google white papers were incorporated into the open source Apache Hadoop project.

5

What is Apache Hadoop ?

It is a cluster technology with a single master and multiple slaves, designed for commodity hardware

It consists of two runtimes, the Hadoop distributed file system (HDFS) and Map/Reduce

As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy

Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine (data locality).

Hadoop may execute or re-execute a job on any node in the cluster.

Node failures are automatically handled by the framework.

6

The Big Data Ecosystem

ClusterChef / Apache Whirr / EC2

Hadoop

Pig / WuKong /Cascading

Cassandra / HBase

Offline Systems (Analytics) Human Consumption

BigSheets / DataMeer

Hive / Karmasphere

Provisioning

Nutch / SQOOP / Flume

Scripting

DBA

Non-Programmer

Import/Export Tooling

Visualizations

Online Systems

(OLTP @ Scale)

NoSQL

Commodity Hardware

Offline customer scenario

Eric Sammer Solution Architect

@esammer #bd4e

Use Case: Product Recommendations

“We can provide a better experience (and make more money) if we provide meaningful product recommendations.”

We need data:

- What products did a user buy?

- What products did a user browse, hover over, rate, add to cart (but not buy) in the last 2 months?

- What are the attributes of the user? (e.g. income, gender, friends)

- What are our margins on products, inventory, upcoming promotions?

Problems

That’s a lot of data! (2 months of activity + all purchase data + all user data) Activity: ~20GB per day x ~60 days = 1.2TB User Data: ~2GB Purchase Data: ~5GB Misc: Inventory, product costs, promotion schedules

Distilling data to aggregates would reduce fidelity.

Easy to see how looking at more data could improve recommendations.

How do we keep this information current?

The Answer

Calculate all qualifying products once a day for each user and store them for quick display

Use Hadoop to process data in parallel on hundreds of machines

Online customer scenario

Matt Pfeil CEO

@mattz62 #bd4e

04/11/23 13

What is Apache Cassandra?

Use Case: Managing Email

“My email volume is growing exponentially. Traditional solutions – including using a SAN – simply can’t keep up. I need to scale horizontally and get incredibly fast real time performance.”

The Problem

How do we achieve scalability, redundancy, high performance?

How do we store billions of files on commodity hardware?

How do we increase capacity by simply adding machines? (No SANs!)

How do we make it FAST?

Requirements

Storage for Email Billions of emails (<100KB avg) 2M users, 100 MB of storage each = 190 TB Growth of 50% every 6 months Durable

Requirements for Storage System No Master/Single Point of Failure Linear scalability + redundancy Multiple Active Data Centers Many reads, many writes Millisecond response times Commodity hardware

Solution

800 TB of Storage

~1.75 Million reads or writes/sec (No Cache!)

130 Machines Read/Write at both Data

Centers No “Master” Data Center

Where to next?The Adjacent Possible

Flip Kromer CTO

@mrflip #bd4e

Something about AnythingSomething about Anything

Everything about SomethingEverything about Something

Bigger than One Computer

Bigger than Frontal Lobe

Bigger than Excel

what’s coming to help

myth of the “data base”

Hold your data

Ask questions

Managing & Shipping

Hadoop FTWCassandra, HBase, ElasticSearch, ...

Integration is still too hard

Dev OpsReliable Decoupling: Flume, Graphite

Data flutters by label

Elephants make sturdy piles {GROUP}

Number becomes thoughtprocess_group

Hadoop

class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)

Twitter Parser in a Tweet

pure functionality

Cassandra

HBase

ElasticSearch

MySQL

Redis

TokyoTyrant

SimpleDB

MongoDB

sqlite

whisper (graphite)

file system

S3

Data Stores in Production

Cassandra

HBase

ElasticSearch

MySQL

Redis

TokyoTyrant

SimpleDB

MongoDB

sqlite

whisper (graphite)

file system

S3

Dev Ops: Rethink Hard

Still Blind

Visual Grammar to see it: NYTimes, Stamen, Ben Fry

Interactive tools: Tableau, Spotfirebloom.io, d3.js, Gephi

Human-Scale Tools

Data-as-a-Service: Infochimps, SimpleGeoDrawnToScale

Business IntelligenceFamiliar Paradigm, New Scale

BigSheets, Datameer

Panel Discussion

Stu Hood Software Engineer

@stuhood #bd4e

Thanks for coming!Stu Hood @stuhood Flip Kromer @mrflipMatt Pfeil @mattz62Eric Sammer @esammerSteve Watt @wattsteve

Final deck

Technology

Transcript of Final deck