Hadoop at datasift

30
HADOOP AT DATASIFT

description

Presentation given at Edinburgh University Student Tech-Meetup on 6th Feb, 2013.

Transcript of Hadoop at datasift

Page 1: Hadoop at datasift

HADOOP AT

DATASIFT

Page 2: Hadoop at datasift

ABOUT MEJairam ChandarBig Data Engineer

Datasift@jairamc

http://about.me/jairamhttp://jairam.me

And I’m a Formula 1 Fan!

Page 3: Hadoop at datasift

OUTLINE

•What is Datasift ?

•Where do we use Hadoop ?

• The Numbers

• The Use-cases

• The Lessons

Page 4: Hadoop at datasift

!! SALES PITCH ALERT !!

Page 5: Hadoop at datasift

WHAT IS DATASIFT?

Page 6: Hadoop at datasift

WHAT IS DATASIFT?

Page 7: Hadoop at datasift

WHAT IS DATASIFT?

Page 8: Hadoop at datasift

WHAT IS DATASIFT?

Page 9: Hadoop at datasift

WHAT IS DATASIFT?

Page 10: Hadoop at datasift

WHAT IS DATASIFT?

Page 11: Hadoop at datasift

WHAT IS DATASIFT?

Page 12: Hadoop at datasift

WHAT IS DATASIFT?

Page 13: Hadoop at datasift

WHAT IS DATASIFT?

Page 14: Hadoop at datasift

THE NUMBERS

•Machines

• HBase

• 60 Machines as RegionServers

• 1 HMaster

• 3 Zookeeper nodes

Page 15: Hadoop at datasift

THE NUMBERS•Machines

• Hadoop

• 135 Machines divided into 2 clusters

•Datanodes/Tasktrakers

•Namenodes with High-Availability Failover

• 1 Jobtracker each

Page 16: Hadoop at datasift

THE NUMBERS• Machines

• DL380 Gen8

• 2 * Intel Xeon E5646 @ 2.40GHz (24 core total)

• 48GB RAM

• 6 * 2 TB disks in JBOD (small partition on first disk for OS, rest is storage)

• 1 Gigabit network links

Page 17: Hadoop at datasift

THE NUMBERS• Data

• Average load of 7500 interactions per second

• Peak loads of 15000 interactions per second sustained over a min

• Peak of 21000 interactions per second during superbowl

• Total current capacity ~ 1.6 PB; Total current usage ~ 800 TB

• Avg size of interaction 2 KB – thats ~ 1GB a min or ~ 2 TB a day with replication (RF = 3)

• And that’s not it!

Page 18: Hadoop at datasift

THE USE CASES• HBase

• Recordings

• Archive

• Map/Reduce

• Exports

• Historics

• Migration

Page 19: Hadoop at datasift

THE USE CASES• Recordings

• User defined streams

• Stored in HBase for later retrieval

• Export to multiple output formats and stores

• <recording-id><interaction-uuid>

• Recording-id is a SHA-1 hash

• Allows recordings to be distributed by their key without generating hot-spots.

Page 20: Hadoop at datasift

THE RECORDER

Page 21: Hadoop at datasift

THE USE CASES• Exporter

• Export data from HBase for customer

• Export files ~ 5 – 10 GB or ~ 3-6 million records

•MR over HBase using TableInputFormat

• But the data needs to be sorted

• TotalOrderPartioner

Page 22: Hadoop at datasift

EXPORTER

Page 23: Hadoop at datasift

HISTORICS

Page 24: Hadoop at datasift

THE USE CASES

• Twitter Import

• 2 years of Tweets

• About 95,000,000,000 tweets

•Over 300 TB with added augmentation

• Import was not as simple as you would imagine

Page 25: Hadoop at datasift

THE USE CASES• Archive

• Not just the Firehose but the Ultrahose

• Stored in HBase as well

• HBase architecture (BigTable) creates Hotspots with Time Series data

• Leading randomizing bit (see HBaseWD)

• Pre-split regions

• Concurrent writes

Page 26: Hadoop at datasift

THE USE CASES• Historics

• Export archive data

• Slightly different from Exporter

• Much larger time lines (1 – 3 months)

• Controlled access to Hadoop cluster with efficient job scheduling

• Unfiltered Input Data

• Therefore longer processing time

• Hence more optimizations required

Page 27: Hadoop at datasift

HISTORICS

Page 28: Hadoop at datasift

THE LESSONS• Tune Tune Tune (Default == BAD)

• Based on use case tune -

• Heap

• Block Size

• Memstore size

• Keep number of column families low

• Be aware of hot-spotting issue when writing time-series data

Page 29: Hadoop at datasift

THE LESSONS

• Use compression (eg. Snappy)

•Ops need intimate understanding of system

•Monitor system metrics (GC, CPU, Compaction, I/O) and application metrics (writes/sec etc)

•Don't be afraid to fiddle with HBase code

• Using a distribution is advisable

Page 30: Hadoop at datasift

QUESTIONS?

We are hiringhttp://datasift.com/about-us/careers