StumbleUpon UK Hadoop Users Group 2011

10

Click here to load reader

Transcript of StumbleUpon UK Hadoop Users Group 2011

Page 1: StumbleUpon UK Hadoop Users Group 2011

A Sneak Peek into StumbleUpon’s Infrastructure

Page 2: StumbleUpon UK Hadoop Users Group 2011

Quick SU Intro

Page 3: StumbleUpon UK Hadoop Users Group 2011

Our Traffic

Page 4: StumbleUpon UK Hadoop Users Group 2011

Our Stack:100% Open-Source

• MySQL (legacy source of truth)• Memcache (lots)• HBase (most new apps / features)• Hadoop (DWH, MapReduce, Hive, ...)• elasticsearch (“you know, for search”)• OpenTSDB (distributed monitoring)• Varnish (HTTP load-balancing)• Gearman (processing off the fast path)• ... etc

In prodsince ’09

Page 5: StumbleUpon UK Hadoop Users Group 2011

The InfrastructureArista 7050 Arista 7050

Arista 7048T Arista 7048T Arista 7048T Arista 7048T

2 core switches

. . .

1U

2U

1U 52 x 10GbESFP+

48x1GbEcopper

4x10GbESFP+

Thin Nodes

ThickNodes

2U

L3 ECMP

MTU=9000. . .

Page 6: StumbleUpon UK Hadoop Users Group 2011

The Infrastructure• SuperMicro half-width motherboards• 2 x Intel L5630 (40W TDP) (16 hardware threads total)• 48GB RAM• Commodity disks (consumer grade SATA 7200rpm)• 1x2TB per “thin node” (4-in-2U) (web/app servers, gearman, etc.)• 6x2TB per “thick node” (2-in-2U) (Hadoop/HBase, elasticsearch, etc.)

(86 nodes = 1PB)

Page 7: StumbleUpon UK Hadoop Users Group 2011

The Infrastructure• No virtualization• No oversubscription• Rack locality doesn’t matter much (sub-100µs RTT across racks)• cgroups / Linux containers to keep MapReduce under control

Two production HBase clusters per colo• Low-latency (user-facing services)• Batch (analytics, scheduled jobs...)

Page 8: StumbleUpon UK Hadoop Users Group 2011

Low-Latency Cluster• Workload mostly driven by HBase• Very few scheduled MR jobs• HBase replication to batch cluster• Most queries from PHP over Thrift

Challenges:• Tuning Hadoop for low latency• Taming the long latency tail• Quickly recovering from failures

Page 9: StumbleUpon UK Hadoop Users Group 2011

Batch Cluster• 2x more capacity• Wildly changing workload (e.g. 40K 14M QPS)• Lots of scheduled MR jobs• Frequent ad-hoc jobs (MR/Hive)• OpenTSDB’s data >800M data points added per day 133B data points totalChallenges:• Resource isolation• Tuning for larger scale