Hadoopsummitfb09 090611023401-phpapp02

Hadoop and Hive at Facebook Data and Applications

Your Company Logo Here

Wednesday, June 10, 2009 Santa Clara Marriott

Dhruba Borthakur, Ding Zhou

Who generates this data?

Lots of data is generated on Facebook »  200 million active users »  20 million users update their statuses at least

once each day »  More than 850 million photos uploaded to the site

each month »  More than 8 million videos uploaded each month »  More than 1 billion pieces of content (web links,

news stories, blog posts, notes, photos, etc.) shared each week

http://www.slideshare.net/guest5b1607/text-analytics-summit-2009-roddy-lindsay-social-media-happiness-petabytes-and-lols

»  Hadoop/Hive Warehouse ›  4800 cores, 2 PetaBytes

total size

»  Other Hadoop Clusters •  HDFS-Scribe cluster: 320

cores, 160 TB total size •  Hadoop Archival Cluster :

80 cores, 200TB total size •  Test cluster : 800 cores,

150 TB total size

Where do we store parts of this data?

Data Collection using Scribe

Web Servers Scribe MidTier

Network Storage and Servers

Hadoop Hive Warehouse Oracle RAC MySQL

Data Collection using Scribe and HDFS

Web Servers

Scribe MidTier

RealBme Hadoop Cluster

Hadoop Hive Warehouse Oracle RAC

MySQL Hadoop Scribe Integration

Data Archive: Move old data to cheap storage

Cheap NAS

Hadoop Archival Cluster 20TB per node

Hadoop Archive Node

Hive Query

distcp

Hadoop Warehouse

HADOOP‐5048

Hive User Interfaces

Hive shell access

Hive Web UI

Data Analysis at Facebook

»  Business Intelligence ›  Growth and monetization strategies ›  Product insights & decisions ›  Philosophy: build meta tools and provide easy access to data

»  Artificial Intelligence ›  Recommendation & ranking products ›  Advertising optimization ›  Text analytics ›  Philosophy: model inference; data preparation; model building;

»  Top-level site metrics

BI: Build centralized reporting tools

Bird-view of user growth by countries

Comparing certain metrics between user groups

»  Example: “Find the number of status updates mentioning ‘swine flu’ per day last month”

»  SELECT a.date, count(1) »  FROM status_updates a »  WHERE a.status LIKE “%swine flu%” »  AND a.date >= ‘2009-05-01’ AND a.date <= ‘2009-05-31’ »  GROUP BY a.date

BI: Make AdHoc reporting easy

Build site metric dashboard in a day

»  Data collection: ›  Define metrics and log format (Hive schema) ›  Add logging to the site (Scribe logging) ›  Create a Hive table partitioned by date ›  Set up metric ETL cron job (Hive -> mysql/oracle)

»  Data visualization (using mysql) »  Data access (adhoc query using Hive)

Build Machine Learning Products on Hadoop/Hive

•  Recommendation & ranking •  Advertising optimization •  Text analytics

What applications the user may like

»  Recommend apps based on social and demographic popularity

»  User-app log is huge »  Joining user-app log with

user demographics is difficult

»  Hive for data aggregation

Who the user wants to connect

»  Take existing edges and user feedbacks as labels

»  Build regression models based on user profile and local graph features

»  Too many friends of friends »  Model trained by sampling

»  Hive for model inference »  Hive for feature selection

»  Market research & ad tool

»  Extract popular words from user content

»  Slice by age, gender, region »  Sentiment analysis »  Keyword association

»  Hadoop used for text analytics

What users are talking about (Lexicon)

laid-off

Words associated with vodka

What ads the user might click on

»  Predict user-ad click-through

»  Ads click data is sparse so sampling can miss info

»  Many ML algorithms are iterative thus not easy for hadoop

»  Hadoop for model training

Build ensemble ML models on Hadoop

»  Each mapper trains a number of models

»  Each model output as a intermediate feature

»  Model selection at reducer »  A regression model is built

on selected features

ds1 ds2 ds3 ds4

ensembles

Train models locally Cross-Test models locally

Models assembled by ensemble methods Model inference in a second Hadoop job

In summary

»  So Zuckerberg’s urgent questions are answered; »  So celebrities know where their fans are from; »  So we know one can like vodka and lemonade at the same time; »  It’s fun playing with the data;

Dhruba Borthakur, Ding Zhou

»  Hadoop and Hive at Facebook »  Support product strategy and decision; »  Recommendation & ranking products; »  Advertising optimization; »  Text analytics tools;

dhruba@, dzhou@

Hadoopsummitfb09 090611023401-phpapp02

Technology

Transcript of Hadoopsummitfb09 090611023401-phpapp02

Uiopasdfnkla 091014131227 Phpapp02 091016124115 Phpapp02

Myvisualresume 13321897241742-phpapp02-120319154635-phpapp02

Adkit110417064737phpapp01 13058042043998-phpapp02-110519062447-phpapp02

Internationaltradefinancesummerinternshipproject 13286408610161 Phpapp02 120207125605 Phpapp02

Bizlawpresentationofferandacceptanceconfirmed 1317123960 Phpapp02 110927064721 Phpapp02

Proposalsamplefromcapgemini 13316039760571 Phpapp02 120312211734 Phpapp02

Temperaturecontrolledfanreport 13420355499797 Phpapp02 120711144146 Phpapp02

Bitrix24betaversionpresentation 120420085007-phpapp02-120529180752-phpapp02

Scheduledshutdownmaintenance 13352344159941 Phpapp02 120423212708 Phpapp02

2011gdsadvertisingstudytravelclick 13265899836091-phpapp02-120114193149-phpapp02

Smawojcikprint 1307728092142 Phpapp02 110610125747 Phpapp02

Kme 13386248648833-phpapp02-120602031724-phpapp02

Licppt 13217744207425-phpapp02-111120013433-phpapp02

120324amwaybusinessopportunitypresentation 13326809046049-phpapp02-120325081048-phpapp02

Basicsonsulzermetcodlccoatings070710 13444300052605 Phpapp02 120808075138 Phpapp02

Mexicopresentationatwwemaapril212011 13125734424684-phpapp02-110805144547-phpapp02

Akcomputerforensics 130222081008-phpapp02-140809110602-phpapp02

K2partneringsolutionspresentation 13118580048926 Phpapp02 110728080107 Phpapp02

Maintest 100713212237-phpapp02-100714080303-phpapp02

Srtcmasterveryshort 13303482929323-phpapp02-120227071512-phpapp02