Spotting Hadoop in the wild

Spotting Hadoop in the wildPractical use cases from Last.fm and Massive Media

@klbostee

Thursday 12 January 12

• “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com

• “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly


• 2007: Started using Hadoop as PhD student

• 2009: Data & Scalability Engineer at Last.fm

• 2011: Data Scientist at Massive Media


• 2007: Started using Hadoop as PhD student

• 2009: Data & Scalability Engineer at Last.fm

• 2011: Data Scientist at Massive Media

• Created Dumbo, a Python API for Hadoop

• Contributed some code to Hadoop itself

• Organized several HUGUK meetups


What are those yellow things?


Core principles

• Distributed

• Fault tolerant

• Sequential reads and writes

• Data locality


Pars pro toto

HDFS

Pig Hive

MapReduceHBa

se

Zoo

Kee

per

Hadoop itself is basically the kernel that provides a file system and task scheduler


Hadoop file system

DataNode DataNodeDataNode


Hadoop file system


File A =


Hadoop file system


File A =

File B =


Hadoop file system


File A =

File B =

Linuxblock

Hadoopblock


Hadoop file system


File A =

File B =

Linuxblock

Hadoopblock

No random writes!


Hadoop task scheduler


TaskTracker TaskTrackerTaskTracker





Job A =





Job A = Job B =


Some practical tips

• Install a distribution

• Use compression

• Consider increasing your block size

• Watch out for small files


HBase

HDFS

Pig Hive

MapReduceHBa

se

Zoo

Kee

per

HBase is a database on top of HDFS that can easily be accessed from MapReduce


Data model

Row keys Column X Column Y Column U

Column family A Column family B

Column V

... ... ... ... ...


Data model



Column V

... ... ... ... ...sort

ed


Data model



Column V

... ... ... ... ...sort

ed

• Configurable number of versions per cell

• Each cell version has a timestamp

• TTL can be specified per column family


Random becomes sequential

KeyValue

KeyValue

...

Commit log Memstoreso

rted

KeyValue

KeyValue

KeyValue

...

HDFS



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...

HDFS



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...

HDFS

KeyValue

sequential write



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write

High write throughput!



KeyValue

KeyValue

KeyValue

...


rted

KeyValue

KeyValue

KeyValue

...KeyValue HDFS

KeyValue

sequential write

sequential write

High write throughput!+ efficient scans+ free empty cells+ no fragmentation+ ...


Horizontal scalingRow keys sorted



Region

RegionServer



Region

RegionServer

Region

...

Region Region

... ...RegionServer RegionServer



Region

RegionServer

Region

...

Region Region

... ...RegionServer RegionServer

• Each region has its own commit log and memstores• Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once


Some practical tips

• Restrict the number of regions per server

• Restrict the number column families

• Use compression

• Increase file descriptor limits on nodes

• Use a large enough buffer when scanning


Look, a herd of Hadoops!


• “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm

• Over 60 billion tracks scrobbled since 2003

• Started using Hadoop in 2006, before Yahoo


• “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu

• Over 80 million users on web and mobile

• Using Hadoop for about a year now


http://netlog.com/

http://netlog.com/

http://twoo.com/

http://twoo.com/

Hadoop adoption

1. Business intelligence

2. Testing and experimentation

3. Fraud and abuse detection

4. Product features

5. PR and marketing


Hadoop adoption




4. Product features

5. PR and marketing

Last.

fm

√√√√√


Hadoop adoption




4. Product features

5. PR and marketing

Last.

fm

Mas

sive

Med

ia

√√√√√

√√√√


Business intelligence


Testing and experimentation


Fraud and abuse detection


Product features


PR and marketing


Let’s dive into the first use case!


Goals and requirements

• Timeseries graphs of 1000 or so metrics

• Segmented over about 10 dimensions


Goals and requirements

• Timeseries graphs of 1000 or so metrics

• Segmented over about 10 dimensions

1. Scale with very large number of events

2. History for graphs must be long enough

3. Accessing the graphs must be instantaneous

4. Possibility to analyse in detail when needed


Attempt #1

• Log table in MySQL

• Generate graphs from this table on-the-fly


Attempt #1

• Log table in MySQL

• Generate graphs from this table on-the-fly

1. Large number of events

2. Long enough history

3. Instantaneous access

4. Analyse in detail

√⁄⁄√


Attempt #2

• Counters in MySQL table

• Update counters on every event


Attempt #2

• Counters in MySQL table

• Update counters on every event





⁄√√⁄


Attempt #3

• Put log files in HDFS through syslog-ng

• MapReduce on logs and write to HBase


Attempt #3

• Put log files in HDFS through syslog-ng

• MapReduce on logs and write to HBase





√ √√√


Architecture

Syslog-ng

HDFS

MapReduce

HBase


Architecture

Syslog-ng

HDFS

MapReduce

HBase

Realtimeprocessing


Architecture

Syslog-ng

HDFS

MapReduce

HBase

RealtimeprocessingAd-hoc

results


HBase schema

• Separate table for each time granularity

• Global segmentations in row keys• <language>||<country>||...|||<timestamp>

• * for “not specified”

• trailing *s are omitted

• Further segmentations in column keys• e.g. payments_via_paypal, payments_via_sms

• Related metrics in same column family


Questions?


Spotting Hadoop in the wild

Technology

Transcript of Spotting Hadoop in the wild