Spotting Hadoop in the wild
-
Upload
klaas-bosteels -
Category
Technology
-
view
3.570 -
download
1
description
Transcript of Spotting Hadoop in the wild
Spotting Hadoop in the wildPractical use cases from Last.fm and Massive Media
@klbostee
Thursday 12 January 12
• “Data scientist is a job title for an employee who analyses data, particularly large amounts of it, to help a business gain a competitive edge” —WhatIs.com
• “Someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning” —Hilary Mason, bit.ly
Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media
Thursday 12 January 12
• 2007: Started using Hadoop as PhD student
• 2009: Data & Scalability Engineer at Last.fm
• 2011: Data Scientist at Massive Media
• Created Dumbo, a Python API for Hadoop
• Contributed some code to Hadoop itself
• Organized several HUGUK meetups
Thursday 12 January 12
What are those yellow things?
Thursday 12 January 12
Core principles
• Distributed
• Fault tolerant
• Sequential reads and writes
• Data locality
Thursday 12 January 12
Pars pro toto
HDFS
Pig Hive
MapReduceHBa
se
Zoo
Kee
per
Hadoop itself is basically the kernel that provides a file system and task scheduler
Thursday 12 January 12
Hadoop file system
DataNode DataNodeDataNode
Thursday 12 January 12
Hadoop file system
DataNode DataNodeDataNode
File A =
Thursday 12 January 12
Hadoop file system
DataNode DataNodeDataNode
File A =
File B =
Thursday 12 January 12
Hadoop file system
DataNode DataNodeDataNode
File A =
File B =
Linuxblock
Hadoopblock
Thursday 12 January 12
Hadoop file system
DataNode DataNodeDataNode
File A =
File B =
Linuxblock
Hadoopblock
No random writes!
Thursday 12 January 12
Hadoop task scheduler
DataNode DataNodeDataNode
TaskTracker TaskTrackerTaskTracker
Thursday 12 January 12
Hadoop task scheduler
DataNode DataNodeDataNode
TaskTracker TaskTrackerTaskTracker
Job A =
Thursday 12 January 12
Hadoop task scheduler
DataNode DataNodeDataNode
TaskTracker TaskTrackerTaskTracker
Job A = Job B =
Thursday 12 January 12
Some practical tips
• Install a distribution
• Use compression
• Consider increasing your block size
• Watch out for small files
Thursday 12 January 12
HBase
HDFS
Pig Hive
MapReduceHBa
se
Zoo
Kee
per
HBase is a database on top of HDFS that can easily be accessed from MapReduce
Thursday 12 January 12
Data model
Row keys Column X Column Y Column U
Column family A Column family B
Column V
... ... ... ... ...
Thursday 12 January 12
Data model
Row keys Column X Column Y Column U
Column family A Column family B
Column V
... ... ... ... ...sort
ed
Thursday 12 January 12
Data model
Row keys Column X Column Y Column U
Column family A Column family B
Column V
... ... ... ... ...sort
ed
• Configurable number of versions per cell
• Each cell version has a timestamp
• TTL can be specified per column family
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...
HDFS
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...
HDFS
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...
HDFS
KeyValue
sequential write
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...KeyValue HDFS
KeyValue
sequential write
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...KeyValue HDFS
KeyValue
sequential write
sequential write
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...KeyValue HDFS
KeyValue
sequential write
sequential write
High write throughput!
Thursday 12 January 12
Random becomes sequential
KeyValue
KeyValue
KeyValue
...
Commit log Memstoreso
rted
KeyValue
KeyValue
KeyValue
...KeyValue HDFS
KeyValue
sequential write
sequential write
High write throughput!+ efficient scans+ free empty cells+ no fragmentation+ ...
Thursday 12 January 12
Horizontal scalingRow keys sorted
Thursday 12 January 12
Horizontal scalingRow keys sorted
Thursday 12 January 12
Horizontal scalingRow keys sorted
Region
RegionServer
Thursday 12 January 12
Horizontal scalingRow keys sorted
Region
RegionServer
Region
...
Region Region
... ...RegionServer RegionServer
Thursday 12 January 12
Horizontal scalingRow keys sorted
Region
RegionServer
Region
...
Region Region
... ...RegionServer RegionServer
• Each region has its own commit log and memstores• Moving regions is easy since the data is all in HDFS • Strong consistency as each region is served only once
Thursday 12 January 12
Some practical tips
• Restrict the number of regions per server
• Restrict the number column families
• Use compression
• Increase file descriptor limits on nodes
• Use a large enough buffer when scanning
Thursday 12 January 12
Look, a herd of Hadoops!
Thursday 12 January 12
• “Last.fm lets you effortlessly keep a record of what you listen to from any player. Based on your taste, Last.fm recommends you more music and concerts” —Last.fm
• Over 60 billion tracks scrobbled since 2003
• Started using Hadoop in 2006, before Yahoo
Thursday 12 January 12
• “Massive Media is the social media company behind the successful digital brands Netlog.com and Twoo.com. We enable members to meet nearby people instantly” —MassiveMedia.eu
• Over 80 million users on web and mobile
• Using Hadoop for about a year now
Thursday 12 January 12
Hadoop adoption
1. Business intelligence
2. Testing and experimentation
3. Fraud and abuse detection
4. Product features
5. PR and marketing
Thursday 12 January 12
Hadoop adoption
1. Business intelligence
2. Testing and experimentation
3. Fraud and abuse detection
4. Product features
5. PR and marketing
Last.
fm
√√√√√
Thursday 12 January 12
Hadoop adoption
1. Business intelligence
2. Testing and experimentation
3. Fraud and abuse detection
4. Product features
5. PR and marketing
Last.
fm
Mas
sive
Med
ia
√√√√√
√√√√
Thursday 12 January 12
Business intelligence
Thursday 12 January 12
Testing and experimentation
Thursday 12 January 12
Fraud and abuse detection
Thursday 12 January 12
Fraud and abuse detection
Thursday 12 January 12
Product features
Thursday 12 January 12
PR and marketing
Thursday 12 January 12
Let’s dive into the first use case!
Thursday 12 January 12
Goals and requirements
• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions
Thursday 12 January 12
Goals and requirements
• Timeseries graphs of 1000 or so metrics
• Segmented over about 10 dimensions
1. Scale with very large number of events
2. History for graphs must be long enough
3. Accessing the graphs must be instantaneous
4. Possibility to analyse in detail when needed
Thursday 12 January 12
Attempt #1
• Log table in MySQL
• Generate graphs from this table on-the-fly
Thursday 12 January 12
Attempt #1
• Log table in MySQL
• Generate graphs from this table on-the-fly
1. Large number of events
2. Long enough history
3. Instantaneous access
4. Analyse in detail
√⁄⁄√
Thursday 12 January 12
Attempt #2
• Counters in MySQL table
• Update counters on every event
Thursday 12 January 12
Attempt #2
• Counters in MySQL table
• Update counters on every event
1. Large number of events
2. Long enough history
3. Instantaneous access
4. Analyse in detail
⁄√√⁄
Thursday 12 January 12
Attempt #3
• Put log files in HDFS through syslog-ng
• MapReduce on logs and write to HBase
Thursday 12 January 12
Attempt #3
• Put log files in HDFS through syslog-ng
• MapReduce on logs and write to HBase
1. Large number of events
2. Long enough history
3. Instantaneous access
4. Analyse in detail
√ √√√
Thursday 12 January 12
Architecture
Syslog-ng
HDFS
MapReduce
HBase
Thursday 12 January 12
Architecture
Syslog-ng
HDFS
MapReduce
HBase
Realtimeprocessing
Thursday 12 January 12
Architecture
Syslog-ng
HDFS
MapReduce
HBase
RealtimeprocessingAd-hoc
results
Thursday 12 January 12
HBase schema
• Separate table for each time granularity
• Global segmentations in row keys• <language>||<country>||...|||<timestamp>
• * for “not specified”
• trailing *s are omitted
• Further segmentations in column keys• e.g. payments_via_paypal, payments_via_sms
• Related metrics in same column family
Thursday 12 January 12
Questions?
Thursday 12 January 12