WaterlooHiveTalk

44
Petabyte Scale Data Warehousing at Facebook Ning Zhang Data Infrastructure Facebook
  • date post

    21-Oct-2014
  • Category

    Technology

  • view

    3.771
  • download

    0

description

 

Transcript of WaterlooHiveTalk

Page 1: WaterlooHiveTalk

Petabyte Scale Data Warehousing at Facebook

Ning Zhang

Data InfrastructureFacebook

Page 2: WaterlooHiveTalk

Overview

Motivations– Data-driven model– Challenges

Data Infrastructure– Hadoop & Hive– In-house tools

Hive Details– Architecture– Data model– Query language– Extensibility

Research Problems

Page 3: WaterlooHiveTalk

Motivations

Page 4: WaterlooHiveTalk

Facebook is just a Set of Web Services …

Page 5: WaterlooHiveTalk

… at Large Scale

The social graph is large– 400 million monthly active users– 250 million daily active users– 160 million active objects (groups/events/pages)– 130 friend connections per user on average– 60 object (groups/events/pages) connections per user on

average Activities on the social graph

– People spent 500 billion minutes per month on FB– Average user creates 70 pieces of content each month– 25 billion pieces of content are shared each month– Millions of search queries per day

Facebook is still growing fast– New users, features, services …

Page 6: WaterlooHiveTalk

Facebook is still growing and changing

Nov-0

0

Jan-

01

Mar-0

1

May-0

1

Jul-0

1

Sep-

01

Nov-0

1

Jan-

02

Mar-0

2

May-0

2

Jul-0

2

Sep-

02

Nov-0

2

Jan-

03

Mar-0

3

May-0

3

Jul-0

3

Sep-

03

Nov-0

3

Jan-

04

Mar-0

4

May-0

4

Jul-0

4

Sep-

04

Nov-0

4

Jan-

05

Mar-0

5

May-0

5

Jul-0

5

Sep-

05

Nov-0

5

Jan-

060

50

100

150

200

250

300

350

400

450

Timeline of Monthly Active Users

MAU

Page 7: WaterlooHiveTalk

Under the Hook

Data flow from users’ perspective– Clients (browser/phone/3rd party apps) Web Services

Users– Another big topic on the Web Services

To complete the feedback system …– The developers want to know how a new app/feature

received by the users (A/B test)– The advertisers want to know how their ads perform

(dashboard/reports)– Based on historical data, how to construct a model and

predicate the future (machine learning) Need data analytics!

– Data warehouse: ETL, data processing, BI …– Closing the loop: decision-making based on analyzing the

data (users’ feedback)

Page 8: WaterlooHiveTalk

Data-driven Business/R&D/Science …

DSS is not new but Web gives it new elements. “In 2009, more data will be generated by individuals

than the entire history of mankind through 2008.”-- by Andreas Weigend, Harvard Business Review

“The center of the universe has shifted from e-business to me-business.”

-- same as above “Invariably, simple models and a lot of data trump

more elaborate models based on less data.” -- by Alon Halevy, Peter Norvig and Fernando Pereira, The

Unreasonable Effectiveness of Data

Page 9: WaterlooHiveTalk

Problems and Challenges

Data-driven development/business – Huge amount of log data/user data generated every day– Need to analyze these data to feedback

development/business decisions– Machine learning, report/dashboard generation, A/B testing

And many more problems– Scalability (more than petabytes)– Availability (HA)– Manageability (e.g., scheduling)– Performance (CPU, memory, disk/network I/O)– And many more…

Page 10: WaterlooHiveTalk

Facebook Engineering Teams (backend)

Facebook Infrastructure– Building foundations that serves end users/applications– OLTP workload– Components include MySQL, memcached, HipHop (PHP),

thrift, Cassandra, Haystack, flashcache, … Facebook Data Infrastructure (data warehouse)

– Building systems that serves data analysts, research scientists, engineers, product managers, executives, etc.

– OLAP workload– Components include Hadoop, Hive, HDFS, scribe, HBase,

tools (ETL, UI, workflow management etc.) Other Engineering teams

– Platform, search, site integrity, monetization, apps, growth, etc.

Page 11: WaterlooHiveTalk

DI Key Challenges (I) – scalability

Data, data and more data– 200 GB/day in March 2008 12 TB/day at the end of 2009– About 8x increase per year – Total size is 5 PB now (x3 when considering replication)– Same order as the Web (~25 billion indexable pages)

Page 12: WaterlooHiveTalk

DI Key Challenges (II) – Performance

Queries, queries and more queries– More than 200 unique users query on the data warehouse

every day– 7K queries/day at the end of 2009– 25K queries/day now– Workload is a mixture of ad-hoc queries and ETL/reporting

queries. Fast, faster and real-time

– Users expect faster response time on fresher data (e.g., fighting with spam/fraud in near real-time)

– Sampling subset of data are not always good enough

Page 13: WaterlooHiveTalk

Other Requirements

Accessibility– Everyone should be be able to log & access data easily, not

only engineers (a lot of our users do not have CS degrees!)– Schema discovery (more than 20K tables)– Data exploration and visualization (learning the data by

looking)– Leverage existing prevalent and familiar tools (e.g., BI tools)

Flexibility– Schema changes frequently (adding new columns, changing

column types, different partitions of tables, etc.)– Data formats could be different (plain text, row store,

column store, complex data types) Extensibility

– Easy to plug in user defined functions, aggregations etc. – Data storage could be files, web services, “NoSQL stores”–

Page 14: WaterlooHiveTalk

Why not Existing Data Warehousing Systems?

Cost of analysis and storage on proprietary systems does not support trends towards more data.– Cost based on data size (15 PB costs a lot!)– Expensive hardware and supports

Limited Scalability does not support trends towards more data– Product designed decades ago (not suitable for petabyte

DW)– ETL is a big bottleneck

Long product development & release cycle– Users requirements changes frequently (agile programming

practice)

Closed and proprietary systems

Page 15: WaterlooHiveTalk

Lets try Hadoop (MapReduce + HDFS) …

Pros– Superior in availability/scalability/manageability (99.9%)– Large and healthy open source community (popular in both

industry and academic organizations)

Page 16: WaterlooHiveTalk

But not quite …

Cons: Programmability and Metadata– Efficiency not that great, but throw more hardware– MapReduce hard to program (users know SQL/bash/python)

hard to debug, so it takes longer to get the results– No schema

Solution: Hive!

Page 17: WaterlooHiveTalk

What is Hive ?

A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– RDBMS for metadata

Key Building Principles:– SQL is a familiar language on data warehouses– Extensibility – Types, Functions, Formats, Scripts (connecting

to HBase, Pig, Hybertable, Cassandra etc.)– Scalability and Performance– Interoperability (JDBC/ODBC/thrift)

Page 18: WaterlooHiveTalk

Hive: Familiar Schema Concepts

Name HDFS Directory

Table pvs /wh/pvs

Partition ds = 20090801, ctry = US /wh/pvs/ds=20090801/ctry=US

Bucketuser into 32 buckets

HDFS file for user hash 0/wh/pvs/ds=20090801/ctry=US/

part-00000

Page 19: WaterlooHiveTalk

Column Data Types

• Primitive Types• integer types, float, string, date, boolean

• Nest-able Collections• array<any-type>• map<primitive-type, any-type>

• User-defined types• structures with attributes which can be of any-type

Page 20: WaterlooHiveTalk

Hive Query Language

DDL– {create/alter/drop} {table/view/partition}– create table as select

DML– Insert overwrite

QL– Sub-queries in from clause– Equi-joins (including Outer joins)– Multi-table Insert– Sampling– Lateral Views

Interfaces– JDBC/ODBC/Thrift

Page 21: WaterlooHiveTalk

Optimizations

Column Pruning– Also pushed down to scan in columnar storage (RCFILE)

Predicate Pushdown– Not pushed below Non-deterministic functions (eg. rand())

Partition Pruning Sample Pruning Handle small files

– Merge while writing– CombinedHiveInputFormat while reading

Small Jobs– SELECT * with partition predicates in the client

Restartability (Work In Progress)

Page 22: WaterlooHiveTalk

Hive: Simplifying Hadoop Programming

$ cat > /tmp/reducer.shuniq -c | awk '{print $2"\t"$1}‘$ cat > /tmp/map.shawk -F '\001' '{if($1 > 100) print $1}‘$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -

input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1

$ bin/hadoop dfs –cat /tmp/largekey/part*

vs.

hive> select key, count(1) from kv1 where key > 100 group by key;

Page 23: WaterlooHiveTalk

MapReduce Scripts Examples

add file page_url_to_id.py;add file my_python_session_cutter.py;FROM (SELECT TRANSFORM(uhash, page_url, unix_time) USING 'page_url_to_id.py' AS (uhash, page_id, unix_time) FROM mylog DISTRIBUTE BY uhash SORT BY uhash, unix_time) mylog2SELECT TRANSFORM(uhash, page_id, unix_time) USING 'my_python_session_cutter.py' AS (uhash, session_info);

Page 24: WaterlooHiveTalk

Hive Architecture

Page 25: WaterlooHiveTalk

Hive: Making Optimizations Transparent

Joins:– Joins try to reduce the number of map/reduce jobs needed.– Memory efficient joins by streaming largest tables.– Map Joins

User specified small tables stored in hash tables on the mapper No reducer needed

Aggregations:– Map side partial aggregations

Hash-based aggregates Serialized key/values in hash tables

– 90% speed improvement on Query SELECT count(1) FROM t;

– Load balancing for data skew

Page 26: WaterlooHiveTalk

Hive: Making Optimizations Transparent

Storage:– Column oriented data formats– Column and Partition pruning to reduce scanned data– Lazy de-serialization of data

Plan Execution– Parallel Execution of Parts of the Plan

Page 27: WaterlooHiveTalk

Hive: Open & Extensible

Different on-disk storage(file) formats– Text File, Sequence File, …

Different serialization formats and data types– LazySimpleSerDe, ThriftSerDe …

User-provided map/reduce scripts– In any language, use stdin/stdout to transfer data …

User-defined Functions– Substr, Trim, From_unixtime …

User-defined Aggregation Functions– Sum, Average …

User-define Table Functions– Explode …

Page 28: WaterlooHiveTalk

Hive: Interoperability with Other Tools

JDBC– Enables integration with JDBC based SQL clients

ODBC– Enables integration with Microstrategy

Thrift– Enables writing cross language clients– Main form of integration with php based Web UI

Page 29: WaterlooHiveTalk

Powered by Hive

Page 30: WaterlooHiveTalk

Usage in Facebook

Page 31: WaterlooHiveTalk

Usage

Types of Applications:– Reporting

Eg: Daily/Weekly aggregations of impression/click counts Measures of user engagement Microstrategy reports

– Ad hoc Analysis Eg: how many group admins broken down by state/country

– Machine Learning (Assembling training data) Ad Optimization Eg: User Engagement as a function of user attributes

– Many others

Page 32: WaterlooHiveTalk

Hadoop & Hive Cluster @ Facebook

Hadoop/Hive cluster– 13600 cores– Raw Storage capacity ~ 17PB– 8 cores + 12 TB per node– 32 GB RAM per node– Two level network topology

1 Gbit/sec from node to rack switch 4 Gbit/sec to top level rack switch

2 clusters– One for adhoc users– One for strict SLA jobs

Page 33: WaterlooHiveTalk

Hive & Hadoop Usage @ Facebook

Statistics per day:– 800TB of I/O per day– 10K – 25K Hive jobs per day

Hive simplifies Hadoop:– New engineers go though a Hive training session– Analysts (non-engineers) use Hadoop through Hive– Most of jobs are Hive Jobs

Page 34: WaterlooHiveTalk

Data Flow Architecture at Facebook

Scirbe-HDFS

Web Servers

Production Hive-Hadoop ClusterOracle RAC Federated MySQL

Scribe-Hadoop Cluster

Adhoc Hive-Hadoop Cluster

Hivereplication

Page 35: WaterlooHiveTalk

Scribe-HDFS: 101

Scribed

Scribed

Scribed

Scribed

Scribed

<category, msgs>

HDFSData Node

HDFSData Node

HDFSData Node

Append to /staging/<category>/<file>

Scribe-HDFS

Page 36: WaterlooHiveTalk

Scribe-HDFS: Near real time Hadoop

Clusters collocated with the web servers

Network is the biggest bottleneck

Typical cluster has about 50 nodes.

Stats:– 50TB/day of raw data logged– 99% of the time data is available within 20 seconds

Page 37: WaterlooHiveTalk

Warehousing at Facebook

Instrumentation (PHP/Python etc.) Automatic ETL

– Continuous copy data to Hive tables Metadata Discovery (CoHive) Query (Hive) Workflow specification and execution (Chronos) Reporting tools Monitoring and alerting

Page 38: WaterlooHiveTalk

Future Work

Scaling in a Dynamic and Fast Growing Environment– Erasure codes for Hadoop– Namenode scalability past 150 million objects

Isolating Adhoc queries from jobs with strict deadlines– Hive Replication

Resource Sharing– Pools for slots

More scalable loading of data– Incremental load of site data– Continuous load of log data

Page 39: WaterlooHiveTalk

Future Work

Discovering Data from > 20K tables– Collaborative Hive

Finding Unused/rarely used Data

Page 40: WaterlooHiveTalk

Future Dynamic Inserts into multiple partitions More join optimizations Persistent UDFs, UDAFs and UDTFs Benchmarks for monitoring performance IN, exists and correlated sub-queries Statistics Materialized Views

Page 41: WaterlooHiveTalk

Research Challenges

Reducing response time for small/medium jobs– 20 thousands queries per day 1 million queries per day– Indexes on Hadoop, data mart strategy– Near real-time query processing – pipelining MapReduce

Distributed systems problems in large scale: – Job scheduling problem: mixed throughput and response

time workloads– Orchestra commits on thousands of machines (scribe conf

files)– Cross data center replication and consistency

Full SQL compliant– Required by 3rd party tools (e.g., BI) through ODBC/JDBC.

Page 42: WaterlooHiveTalk

Query Optimizations

Efficiently compute histograms, median, distinct values in a distributed shared-nothing architecture

Cost models in the MapReduce framework

Page 43: WaterlooHiveTalk

Social Graph

Every user sees a different, personalized stream of information (news feed)– 130 friend + 60 object updates in real time– Edge-rank: ranking of updates that should be shown on the

top Social graph is stored in distributed MySQL

databases– Data replication between data centers: an update to one

data center should be replicated to other data centers as well

– How to partition a dense graph such that data transfer from different partitions is minimized.

Page 44: WaterlooHiveTalk

Questions?