Facebook Retrospective - Big data-world-europe-2012

Post on 11-Jun-2015

449 views 1 download

Tags:

description

A retrospective on building, running and using the Hadoop stack at Facebook.

Transcript of Facebook Retrospective - Big data-world-europe-2012

Data Infrastructure at FacebookA retrospective

Joydeep Sen SarmaEx-Facebook DI Lead, Founder Qubole

Intro

• File/Database Systems developer (ex- Netapp/Oracle)• Yahoo (2005-07), Facebook (2007-11)

• @Facebook:– SysAdmin: operated massive Hadoop/Hive installs– Architect: conceived/wrote Apache Hive. made Hbase@FB

happen– Herded cats: first manager of Data Infra team– IT engineer/DBA: built ETL tools, warehouse/reporting for FB

Virtual Currency– Vested my stock options!

• Founder Qubole Inc. (2011-)

What not to do: Yahoo

• Want to add ‘feed’ in warehouse? Fill form, schmooze PM, wait 2 months.

• Want to justify project? Take $100M, double count 5 times.

• Hard to find out what data exists in company., silos

• Lots of grand architecture, but no progress

Goals going in

• Universal ability to log data and compute against it

• Build infrastructure for data processing– Help people help themselves– Get out of the way

• Done is better than perfect, Move Fast.– Iterate, Fix Failures Fast, Do everything twice

• Sep, 2007:– Use Case: compute relationship strength between friends– Data Sets: user graph, interaction and page-view logs– ~10TB cluster

…• July, 2011:– Ads reporting/data-mining, News Feed ranking, Spam

classification, PYMK, Search Indexing, Entitization, Sentiment Analysis, Fraud Analysis ..

– ~10k queries a day, hundreds of users, scores concurrent– 50PB cluster, 15 engineers/ops in total manning.

State of the Union

User Feedback

• Ex-Yahoo Senior-Directory Ads Product Mgmt.: "I haven't done SQL for ages - but I can use this stuff easily“

• Ex-Yahoo Data Scientist: "This is so amazing. That all data is stored in one place and I can get access instantly without having to wait months and contact multiple groups/silos“

• Ex-Paypal Fraud Analyst: "So much better data and infrastructure than I have ever had in the past"

Key Highways

• Hive– Centrally managed Hadoop service, no setup– SQL is easy, add scripts for map-reduce– Browser based query wizards for SQL dummies• Download results to Excel• Schedule queries periodically with a few clicks

• Scribe– Just log data using Scribe from any application– Dead simple to add attributes to user page views– Easy to pull data from RDBMS

Key Highways

• Simple Workflow authoring system (Databee)

• Reporting is easy– Provision MySQL Data-marts in hours– Easy self-service charting/dashboarding software

• Data Explorer– Wiki like system for documenting tables, columns, types– Keyword Search, find table authors, users– Help people help people

Democracies – Ugh!

“Democracy may not be the perfect … but it is better than the alternatives.”

“The family that poops together stays together”

Maintaining Order

• Hadoop Fair Scheduler– Guarantee resources to projects/users. Share excess capacity

• Multiple Compute tiers– Production, Large Ad-hoc, Small Ad-hoc, Local-mode queries

• Kill the bad guys– Code to hunt down bad queries/apps– Track cpu/disk usage – go after biggies

• Ban assault rifles– Basic ACLs – can’t delete important tables, directories

Why did we succeed?

DATA

DATA

All Heil Data Consolidation

(9pm, FB Hack Night)Ads Engineering Director:“Hey Joy, I want to join user fb-currency purchases with friend request data to test a thesis – pointers?”

Hadoop• Cheap

– Can consolidate everything.– We made it cheaper (RCFile, HDFS-RAID)

• Reduces governance cost– Only worry about really really large stuff.– Less data replication processes to manage

• Separates compute from storage– Most legacy vendors don’t get this

• Disk Based analytic systems degrade gracefully– No tipping point (vs. in-memory only)– Ability to catchup, go back in past (vs. real-time stream processing only)

Things we missed

Things we missed

• SLOOOOOOW– Extensive work on FB Hadoop repo for faster scheduling– Make testing faster (approx. queries)– Watch @Qubole

• SQL as rope– Need higher level templates. Don’t need 10 versions of a 30-day

moving average calculator

• Duplication of queries/jobs– How to discover if there’s existing summaries?– People help people, but still ..

• Didn’t build enough APIs

Final Words

• It’s not the software stupid– Software is easy to write and fix– Can be slow

• It’s the service that matters– Making everything work seamlessly– Ability to fix/improve things FAST

Q&A