Post on 10-May-2015
Hadoop and Hive at Facebook Data and Applications
Your Company Logo Here
Wednesday, June 10, 2009 Santa Clara Marriott
Dhruba Borthakur, Ding Zhou
Who generates this data?
Lots of data is generated on Facebook » 200 million active users » 20 million users update their statuses at least
once each day » More than 850 million photos uploaded to the site
each month » More than 8 million videos uploaded each month » More than 1 billion pieces of content (web links,
news stories, blog posts, notes, photos, etc.) shared each week
http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols
» Hadoop/Hive Warehouse › 4800 cores, 2 PetaBytes
total size
» Other Hadoop Clusters • HDFS-Scribe cluster: 320
cores, 160 TB total size • Hadoop Archival Cluster :
80 cores, 200TB total size • Test cluster : 800 cores,
150 TB total size
Where do we store parts of this data?
Data Collection using Scribe
Web Servers Scribe MidTier
Network Storage and Servers
Hadoop Hive Warehouse Oracle RAC MySQL
Data Collection using Scribe and HDFS
Web Servers
Scribe MidTier
RealBme Hadoop Cluster
Hadoop Hive Warehouse Oracle RAC
MySQL Hadoop Scribe Integration
Data Archive: Move old data to cheap storage
Cheap NAS
Hadoop Archival Cluster 20TB per node
Hadoop Archive Node
NFS
Hive Query
distcp
Hadoop Warehouse
HADOOP‐5048
Hive User Interfaces
Hive shell access
Hive Web UI
Data Analysis at Facebook
» Business Intelligence › Growth and monetization strategies › Product insights & decisions › Philosophy: build meta tools and provide easy access to data
» Artificial Intelligence › Recommendation & ranking products › Advertising optimization › Text analytics › Philosophy: model inference; data preparation; model building;
» Top-level site metrics
BI: Build centralized reporting tools
Bird-view of user growth by countries
Comparing certain metrics between user groups
» Example: “Find the number of status updates mentioning ‘swine flu’ per day last month”
» SELECT a.date, count(1) » FROM status_updates a » WHERE a.status LIKE “%swine flu%” » AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ » GROUP BY a.date
BI: Make AdHoc reporting easy
Build site metric dashboard in a day
» Data collection: › Define metrics and log format (Hive schema) › Add logging to the site (Scribe logging) › Create a Hive table partitioned by date › Set up metric ETL cron job (Hive -> mysql/oracle)
» Data visualization (using mysql) » Data access (adhoc query using Hive)
Build Machine Learning Products on Hadoop/Hive
• Recommendation & ranking • Advertising optimization • Text analytics
What applications the user may like
» Recommend apps based on social and demographic popularity
» User-app log is huge » Joining user-app log with
user demographics is difficult
» Hive for data aggregation
Who the user wants to connect
» Take existing edges and user feedbacks as labels
» Build regression models based on user profile and local graph features
» Too many friends of friends » Model trained by sampling
» Hive for model inference » Hive for feature selection
» Market research & ad tool
» Extract popular words from user content
» Slice by age, gender, region » Sentiment analysis » Keyword association
» Hadoop used for text analytics
What users are talking about (Lexicon)
laid-off
Words associated with vodka
What ads the user might click on
» Predict user-ad click-through
» Ads click data is sparse so sampling can miss info
» Many ML algorithms are iterative thus not easy for hadoop
» Hadoop for model training
Build ensemble ML models on Hadoop
» Each mapper trains a number of models
» Each model output as a intermediate feature
» Model selection at reducer » A regression model is built
on selected features
ds1 ds2 ds3 ds4
ensembles
Train models locally Cross-Test models locally
Models assembled by ensemble methods Model inference in a second Hadoop job
In summary
» So Zuckerberg’s urgent questions are answered; » So celebrities know where their fans are from; » So we know one can like vodka and lemonade at the same time; » It’s fun playing with the data;
Dhruba Borthakur, Ding Zhou
» Hadoop and Hive at Facebook » Support product strategy and decision; » Recommendation & ranking products; » Advertising optimization; » Text analytics tools;
dhruba@, dzhou@