Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified...
Transcript of Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified...
![Page 1: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/1.jpg)
Real-time Analytics at Facebook
Zheng Shao
10/18/2011
![Page 2: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/2.jpg)
1 Analytics and Real-time
2 Data Freeway
3 Puma
4 Future Works
Agenda
![Page 3: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/3.jpg)
Analytics and Real-time what and why
![Page 4: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/4.jpg)
Facebook Insights
• Use cases
▪ Websites/Ads/Apps/Pages
▪ Time series
▪ Demographic break-downs
▪ Unique counts/heavy hitters
• Major challenges
▪ Scalability
▪ Latency
![Page 5: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/5.jpg)
Analytics based on Hadoop/Hive
• 3000-node Hadoop cluster
• Copier/Loader: Map-Reduce hides machine failures
• Pipeline Jobs: Hive allows SQL-like syntax
• Good scalability, but poor latency! 24 – 48 hours.
Scribe NFS HTTP Hive Hadoop
MySQL
seconds seconds Hourly
Copier/Loader Daily
Pipeline Jobs
![Page 6: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/6.jpg)
How to Get Lower Latency?
• Small-batch Processing
▪ Run Map-reduce/Hive every hour, every
15 min, every 5 min, …
▪ How do we reduce per-batch
overhead?
• Stream Processing
▪ Aggregate the data as soon as it arrives
▪ How to solve the reliability problem?
![Page 7: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/7.jpg)
Decisions
• Stream Processing wins!
• Data Freeway
▪ Scalable Data Stream Framework
• Puma
▪ Reliable Stream Aggregation Engine
![Page 8: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/8.jpg)
Data Freeway scalable data stream
![Page 9: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/9.jpg)
Scribe
• Simple push/RPC-based logging system
• Open-sourced in 2008. 100 log categories at that time.
• Routing driven by static configuration.
Scribe Clients
Scribe Mid-Tier
Scribe Writers
NFS
HDFS
Log Consumer
Batch Copier
tail/fopen
![Page 10: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/10.jpg)
• 9GB/sec at peak, 10 sec latency, 2500 log categories
Data Freeway
Scribe Clients Calligraphus
Mid-tier Calligraphus
Writers HDFS
HDFS
C1
C1
C2
C2
DataNode
DataNode
PTail
Zookeeper
Log Consumer
Continuous Copier
PTail (in the plan)
![Page 11: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/11.jpg)
Calligraphus
• RPC File System
▪ Each log category is represented by 1 or more FS directories
▪ Each directory is an ordered list of files
• Bucketing support
▪ Application buckets are application-defined shards.
▪ Infrastructure buckets allows log streams from x B/s to x GB/s
• Performance
▪ Latency: Call sync every 7 seconds
▪ Throughput: Easily saturate 1Gbit NIC
![Page 12: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/12.jpg)
Continuous Copier
• File System File System
• Low latency and smooth network usage
• Deployment
▪ Implemented as long-running map-only job
▪ Can move to any simple job scheduler
• Coordination
▪ Use lock files on HDFS for now
▪ Plan to move to Zookeeper
![Page 13: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/13.jpg)
PTail
• File System Stream ( RPC )
• Reliability
▪ Checkpoints inserted into the data stream
▪ Can roll back to tail from any data checkpoints
▪ No data loss/duplicates
directory
files
directory
directory
checkpoint
![Page 14: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/14.jpg)
Channel Comparison
Push / RPC Pull / FS
Latency 1-2 sec 10 sec
Loss/Dups Few None
Robustness Low High
Complexity Low High Push / RPC
Pull / FS
Scribe
Calligraphus PTail + ScribeSend
Continuous Copier
![Page 15: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/15.jpg)
Puma real-time aggregation/storage
![Page 16: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/16.jpg)
Overview
• ~ 1M log lines per second, but light read
• Multiple Group-By operations per log line
• The first key in Group By is always time/date-related
• Complex aggregations: Unique user count, most frequent
elements
Log Stream Aggregations Storage
Serving
![Page 17: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/17.jpg)
MySQL and HBase: one page
MySQL HBase
Parallel Manual sharding Automatic
load balancing
Fail-over Manual master/slave
switch
Automatic
Read efficiency High Low
Write efficiency Medium High
Columnar support No Yes
![Page 18: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/18.jpg)
Puma2 Architecture
• PTail provide parallel data streams
• For each log line, Puma2 issue “increment” operations to
HBase. Puma2 is symmetric (no sharding).
• HBase: single increment on multiple columns
PTail Puma2 HBase Serving
![Page 19: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/19.jpg)
Puma2: Pros and Cons
• Pros
▪ Puma2 code is very simple.
▪ Puma2 service is very easy to maintain.
• Cons
▪ “Increment” operation is expensive.
▪ Do not support complex aggregations.
▪ Hacky implementation of “most frequent elements”.
▪ Can cause small data duplicates.
![Page 20: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/20.jpg)
Improvements in Puma2
• Puma2
▪ Batching of requests. Didn‟t work well because of long-tail distribution.
• HBase
▪ “Increment” operation optimized by reducing locks.
▪ HBase region/HDFS file locality; short-circuited read.
▪ Reliability improvements under high load.
• Still not good enough!
![Page 21: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/21.jpg)
Puma3 Architecture
• Puma3 is sharded by aggregation key.
• Each shard is a hashmap in memory.
• Each entry in hashmap is a pair of
an aggregation key and a user-defined aggregation.
• HBase as persistent key-value storage.
PTail Puma3 HBase
Serving
![Page 22: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/22.jpg)
Puma3 Architecture
• Write workflow
▪ For each log line, extract the columns for key and value.
▪ Look up in the hashmap and call user-defined aggregation
PTail Puma3 HBase
Serving
![Page 23: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/23.jpg)
Puma3 Architecture
• Checkpoint workflow
▪ Every 5 min, save modified hashmap entries, PTail checkpoint to
HBase
▪ On startup (after node failure), load from HBase
▪ Get rid of items in memory once the time window has passed
PTail Puma3 HBase
Serving
![Page 24: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/24.jpg)
Puma3 Architecture
• Read workflow
▪ Read uncommitted: directly serve from the in-memory hashmap; load
from Hbase on miss.
▪ Read committed: read from HBase and serve.
PTail Puma3 HBase
Serving
![Page 25: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/25.jpg)
Puma3 Architecture
• Join
▪ Static join table in HBase.
▪ Distributed hash lookup in user-defined function (udf).
▪ Local cache improves the throughput of the udf a lot.
PTail Puma3 HBase
Serving
![Page 26: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/26.jpg)
Puma2 / Puma3 comparison
• Puma3 is much better in write throughput
▪ Use 25% of the boxes to handle the same load.
▪ HBase is really good at write throughput.
• Puma3 needs a lot of memory
▪ Use 60GB of memory per box for the hashmap
▪ SSD can scale to 10x per box.
![Page 27: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/27.jpg)
Puma3 Special Aggregations
• Unique Counts Calculation
▪ Adaptive sampling
▪ Bloom filter (in the plan)
• Most frequent item (in the plan)
▪ Lossy counting
▪ Probabilistic lossy counting
![Page 28: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/28.jpg)
PQL – Puma Query Language • CREATE INPUT TABLE t („time',
„adid‟, „userid‟);
• CREATE VIEW v AS
SELECT *, udf.age(userid)
FROM t
WHERE udf.age(userid) > 21
• CREATE HBASE TABLE h …
• CREATE LOGICAL TABLE l …
• CREATE AGGREGATION „abc‟
INSERT INTO l (a, b, c)
SELECT
udf.hour(time),
adid,
age,
count(1),
udf.count_distinc(userid)
FROM v
GROUP BY
udf.hour(time),
adid,
age;
![Page 29: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/29.jpg)
Future Works challenges and opportunities
![Page 30: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/30.jpg)
Future Works
• Scheduler Support
▪ Just need simple scheduling because the work load is continuous
• Mass adoption
▪ Migrate most daily reporting queries from Hive
• Open Source
▪ Biggest bottleneck: Java Thrift dependency
▪ Will come one by one
![Page 31: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/31.jpg)
Similar Systems
• STREAM from Stanford
• Flume from Cloudera
• S4 from Yahoo
• Rainbird/Storm from Twitter
• Kafka from Linkedin
![Page 32: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/32.jpg)
Key differences
• Scalable Data Streams
▪ 9 GB/sec with < 10 sec of latency
▪ Both Push/RPC-based and Pull/File System-based
▪ Components to support arbitrary combination of channels
• Reliable Stream Aggregations
▪ Good support for Time-based Group By, Table-Stream Lookup Join
▪ Query Language: Puma : Realtime-MR = Hive : MR
▪ No support for sliding window, stream joins
![Page 33: Real-time Analytics at Facebook Analytics at Facebook Zheng Shao ... Every 5 min, save modified hashmap entries, ... •S4 from Yahoo](https://reader034.fdocuments.us/reader034/viewer/2022051801/5ae54b597f8b9a08778b8e07/html5/thumbnails/33.jpg)
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0