July 2012 HUG: Building Data Pipelines on Hadoop

25
Sameer Raheja Director Engineering, Yahoo! July 18, 2012 Data Pipeline Overview

description

This talk will review the components required to build large scale data pipelines on Hadoop. The talk will draw on the experience of building large scale data pipelines at Yahoo.

Transcript of July 2012 HUG: Building Data Pipelines on Hadoop

Page 1: July 2012 HUG: Building Data Pipelines on Hadoop

Sameer Raheja Director Engineering, Yahoo!

July 18, 2012

Data Pipeline Overview

Page 2: July 2012 HUG: Building Data Pipelines on Hadoop

2

Data Pipeline Overview

•  What is a Data Pipeline? •  What components are required for Data Pipelines •  How Hadoop is used to solve the Data Pipeline challenge at Yahoo

Page 3: July 2012 HUG: Building Data Pipelines on Hadoop

3

•  Wikipedia defines Pipeline Computing as

–  “A set of data processing elements connected in series, so that the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion; in that case, some amount of buffer storage is often inserted between elements.”

– http://en.wikipedia.org/wiki/Pipeline_%28computing%29

Page 4: July 2012 HUG: Building Data Pipelines on Hadoop

4

What makes a Data Pipeline complex?

Page 5: July 2012 HUG: Building Data Pipelines on Hadoop

5

Data Volume

Page 6: July 2012 HUG: Building Data Pipelines on Hadoop

6

Time

Page 7: July 2012 HUG: Building Data Pipelines on Hadoop

7

Frequency

Page 8: July 2012 HUG: Building Data Pipelines on Hadoop

8

Parallelism

Page 9: July 2012 HUG: Building Data Pipelines on Hadoop

9

Catch Up

Page 10: July 2012 HUG: Building Data Pipelines on Hadoop

10

Reprocessing

Page 11: July 2012 HUG: Building Data Pipelines on Hadoop

11

Coordination

Page 12: July 2012 HUG: Building Data Pipelines on Hadoop

12

A few more

•  Data Policy •  Capacity Planning •  Monitoring •  Alerting

•  Retries

Page 13: July 2012 HUG: Building Data Pipelines on Hadoop

13

Physical Representation of a Pipeline

Page 14: July 2012 HUG: Building Data Pipelines on Hadoop

14

Definition: Stage

filter_serve(5m)

filter_click(15m)

click_serve_join(15m)

click_by_demo(hourly)

<Stage name=“filter_click” <Schedule frequency=“15m" offset=“0” timezone=“UTC”/> ... </Stage>

<Stage name=“click_serve_join” <Schedule frequency="15m" offset=“0” timezone=“UTC”/> ... </Stage>

<Stage name=“click_by_demo” <Schedule frequency=“hourly" offset=“0” timezone=“UTC”/> ... </Stage>

<Stage name=“filter_serve” <Schedule frequency="5m" offset=“0” timezone=“UTC”/> ... </Stage>

Page 15: July 2012 HUG: Building Data Pipelines on Hadoop

15

Definition: Stage Dependencies

filter_serve(5m)

filter_click(15m)

click_serve_join(15m)

click_by_demo(hourly)

<Stage name=“filter_serve” <Schedule frequency="5m" offset=“0” timezone=“UTC”/> ... </Stage>

<Stage name=“filter_click” <Schedule frequency=“15m" offset=“0” timezone=“UTC”/> ... </Stage>

<Stage name=“click_serve_join” <Dependencies> <DependsOn stageName=“filter_clicks” /> <DependsOn stageName=“filter_serves” start=“$stageStartTime - 2” end=“$stageStartTime”/> </Dependencies> ... </Stage>

<Stage name=“click_serve_join” <Dependencies> <DependsOn stageName=“click_serve_join” start=“$stageStartTime - 3” end=“$stageStartTime”/> </Dependencies> ... </Stage>

Page 16: July 2012 HUG: Building Data Pipelines on Hadoop

16

bclk_15mgclk_15m

bsrv_5mgsrv_15m

filter_serve(5m)

filter_click(15m)

aclk_15m

click_serve_join(15m)

click_by_demo(hourly)

cdem_hourly

Definition: Feed Dependencies

<Stage name=“filter_serve” <Data> <Outputs> <Output feedID=“gsrv”/> <Output feedID=“bsrv”/> </Outputs> </Data> ... </Stage>

<Stage name=“filter_click” <Data> <Outputs> <Output feedID=“gclk”/> <Output feedID=“bclk”/> </Outputs> </Data> ... </Stage>

<Stage name=“click_serve_join” <Data> <Inputs> <Input feedID=“bclk”/> <Input feedID=“gclk”/> <Input feedID=“gsrv”/> <Input feedID=“bsrv”/> </Inputs> <Outputs> <Output feedID=“aclk”/> </Outputs> </Data> ... </Stage>

<Stage name=“click_by_demo” <Data> <Inputs> <Input feedID=“aclk”/> </Inputs> <Outputs> <Output feedID=“cdem”/> </Outputs> </Data> ... </Stage>

Page 17: July 2012 HUG: Building Data Pipelines on Hadoop

17

W1W2

W4W5W1

W2 W6W3

W3W4W1

W2

W1W2

W3

bclk_15mgclk_15m

bsrv_5mgsrv_15m

filter_serve(5m)

filter_click(15m)

aclk_15m

click_serve_join(15m)

click_by_demo(hourly)

cdem_hourly

Definition: Jobs & Parallelism

<Stage name=“filter_click” <Parallelism value=“2” /> ... </Stage>

<Stage name=“click_serve_join” <Parallelism value=“6” /> ... </Stage>

<Stage name=“filter_serve” <Parallelism value=“4” /> ... </Stage>

<Stage name=“click_by_demo” <Parallelism value=“3” /> ... </Stage>

Page 18: July 2012 HUG: Building Data Pipelines on Hadoop

18

Definition: Execution Plan

cdemaclkgclk

bclk

gsrv

bsrv

gsrv

bsrv

gsrv

bsrv

aclkgclk

bclk

gsrv

bsrv

gsrv

bsrv

gsrv

bsrv

aclkgclk

bclk

gsrv

bsrv

gsrv

bsrv

gsrv

bsrv

aclkgclk

bclk

05:15 05:30 06:0005:45

filter_clicks(15m)

filter_serves(5m)

05:05 05:10 05:20 05:25 05:35 05:40 05:50 05:55

click_serve_join(15m)

filter_serves(5m)

filter_serves(5m)

filter_clicks(15m)

filter_serves(5m)

click_serve_join(15m)

filter_serves(5m)

filter_serves(5m)

filter_clicks(15m)

filter_serves(5m)

click_serve_join(15m)

filter_serves(5m)

filter_serves(5m)

filter_clicks(15m)

filter_serves(5m)

click_serve_join(15m)

filter_serves(5m)

filter_serves(5m)

click_by_demo(hourly)

gsrv

bsrv

gsrv

bsrv

gsrv

bsrv

06:05 06:10 06:15

Stage

Instances

Feed

Instances

Jobs

Page 19: July 2012 HUG: Building Data Pipelines on Hadoop

19

Data Pipeline Components – how to put it together

Component Definition

Data Collection Ability to transport data from data event producers to a single repository

Data Acquisition Ability to pull from a variety of external sources

Data Storage System to store and access large volumes of data quickly

Data Processing The ability to transform data in various useful ways including annotation, filtering and aggregation

Table Management / Meta Data

Provide a consistent API for data consumers with a standard meta data system

Job Coordination/Scheduling

Ability to schedule, submit, manage, retry, reprocess, catch up a DAG

Data Output Enables push or pull based delivery of data subject to policies

Data Policy Management Anonymize, retain, clean up and archive data

Monitoring / System Management

Provide the ability to operate, visualize and install pipelines

Page 20: July 2012 HUG: Building Data Pipelines on Hadoop

20

What is a Data Pipeline at Yahoo?

Page 21: July 2012 HUG: Building Data Pipelines on Hadoop

21

Sample Pipeline Flow

Event Stream

Raw Data

Event Transformer

Inter Event Joins

Fraud Detection

Pre Aggregate

Analysis Optimization

Reporting Research Targeting

Collection Extract, Transform and Load Business Logic Subflows

Verification

PreAggregate

Verification Definitive Metrics

Page 22: July 2012 HUG: Building Data Pipelines on Hadoop

22

Sample DAG

KS_SCJi: post_tp_ks_click (5m)

o: annotated_ks_click (5m)[priority: 500]

KS_SCJ_CSi: annotated_ks_click (5m)o: annotated_ks_click (5m)

[latency: 35][priority: 500]

SCJ_KS_CLICK_INITi: post_tp_ks_click (5m)

[priority: 500]

SCJ_KS_SERVE_BDB_INITi: ks_serve_int,ks_serve_bdb_int (5m)

[priority: 500]

ER_BOOKING_CLICK_IMPR_KS_STATSi: er_booking_click_impr_ks (15m)o: er_booking_click_impr_ks (15m)

[priority: 400]

ER_BOOKING_CLICK_IMPR_KS_STATS_CSi: er_booking_click_impr_ks (15m)

[latency: 60][priority: 400]

ER_BOOKING_CLICK_IMPR_KS_QUERYi: pub_ep_report_ks (15m)

o: er_booking_click_impr_ks (15m)[latency: 60]

[priority: 400]

IR_PATH_PERF_NGD_INIT_HOURLYi: ngd_preagg (5m)

[priority: 200]

IR_PATH_PERF_NGD_QUERY_HOURLYi: ngd_preagg (5m)

o: ir_path_perf_ngd (hourly)[latency: 60]

[priority: 200]

NGD_PREAGG_QUERY_COMPLETEi: post_tp_ngd_serve,post_tp_ngd_click,ngd_conversion (5m)

o: ngd_preagg (5m)[latency: 25]

[priority: 100]

ER_CLICK_IMPR_NGD_INITi: ngd_preagg (5m)

[priority: 400]

SOX_METRICS_NGD_INITi: ngd_preagg (5m)

[priority: 500]

AM_NGD_INITi: ngd_preagg (5m)

[priority: 500]

OF_NGD_ORDER_HOURLY_INITi: ngd_preagg (5m)

[priority: 100]

ER_BOOKING_CLICK_IMPR_INITi: pub_ep_report (15m)

[priority: 400]

ER_BOOKING_CLICK_IMPR_QUERYi: pub_ep_report (15m)

o: er_booking_click_impr (15m)[latency: 60]

[priority: 400]

PUB_EP_REPORT_QUERYi: gd_preagg (5m)

o: pub_ep_report (15m)[latency: 60]

[priority: 400]

IR_PUB_PERF_INIT_HOURLYi: pub_ep_report (15m)

[priority: 400]

SCJi: gd_click (5m)

o: annotated_gd_click (5m)[priority: 500]

SCJ_CSi: annotated_gd_click (5m)o: annotated_gd_click (5m)

[latency: 35][priority: 500]

SCJ_GD_SERVE_BDB_INITi: gd_serve_int,gd_serve_bdb_int (5m)

[priority: 500]

SCJ_GD_CLICK_INITi: gd_click (5m)[priority: 500]

KS_SERVE_BDBi: ks_serve (5m)

o: ks_serve_bdb_int,ks_serve_int (5m)[priority: 500]

KS_SERVE_INT_CSi: ks_serve_int (5m)o: ks_serve_int (5m)

[latency: 30][priority: 500]

KS_SERVE_BDB_CSi: ks_serve_bdb_int (5m)o: ks_serve_bdb_int (5m)

[latency: 30][priority: 500]

KS_SERVE_BDB_INITi: ks_serve (5m)[priority: 500]

SIJi: gd_impr (5m)

o: annotated_gd_impression (5m)[priority: 500]

SIJ_CSi: annotated_gd_impression (5m)o: annotated_gd_impression (5m)

[latency: 35][priority: 500]

SIJ_GD_IMPR_INITi: gd_impr (5m)[priority: 500]

SIJ_GD_SERVE_BDB_INITi: gd_serve_int,gd_serve_bdb_int (5m)

[priority: 500]

NGD_SCJi: ngd_click (5m)

o: annotated_ngd_click (5m)[priority: 500]

NGD_SCJ_CSi: annotated_ngd_click (5m)o: annotated_ngd_click (5m)

[latency: 35][priority: 500]

NGD_SCJ_SERVE_BDB_INITi: ngd_serve_int,ngd_serve_bdb_int (5m)

[priority: 500]

NGD_SCJ_CLICK_INITi: ngd_click (5m)

[priority: 500]

NGD_COB[priority: 500]

NGD_SERVE_CSi: ngd_serve (5m)o: ngd_serve (5m)

[latency: 25][priority: 500]

NGD_CLICK_CSi: ngd_click (5m)o: ngd_click (5m)

[latency: 25][priority: 500]

NGD_CONV_CSi: ngd_conversion (5m)o: ngd_conversion (5m)

[latency: 25][priority: 500]

ER_CM_CLICK_IMPR_GD_STATSi: er_cm_click_impr_gd (15m)o: er_cm_click_impr_gd (15m)

[priority: 500]

ER_CM_CLICK_IMPR_GD_STATS_CSi: er_cm_click_impr_gd (15m)o: er_cm_click_impr_gd (15m)

[latency: 60][priority: 500]

ER_CM_CLICK_IMPR_GD_QUERYi: cm_gd_preagg (15m)

o: er_cm_click_impr_gd (15m)[latency: 60]

[priority: 500]

NGD_SERVE_FILTERED_QUERYi: ngd_serve (5m)

o: ngd_serve_filtered (5m)[latency: 120][priority: 500]

NGD_SERVE_FILTERED_INITi: ngd_serve (5m)

[priority: 500]

IMS_QUOTA_SERVER_STATSi: ims_quota_server (15m)o: ims_quota_server (15m)

[priority: 100]

IMS_QUOTA_SERVER_QUERY_STATS_CSi: ims_quota_server (15m)o: ims_quota_server (15m)

[latency: 500][priority: 100]

IMS_QUOTA_SERVER_QUERYi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m,5m)

o: ims_quota_server (15m)[latency: 500][priority: 100]

SMJi: creative_metric (5m)

o: annotated_gd_cm (5m)[priority: 500]

SMJ_CSi: annotated_gd_cm (5m)o: annotated_gd_cm (5m)

[latency: 35][priority: 500]

SMJ_CREATIVE_METRIC_INITi: creative_metric (5m)

[priority: 500]

SMJ_GD_SERVE_BDB_INITi: gd_serve_int,gd_serve_bdb_int (5m)

[priority: 500]

GD_SERVE_BDBi: gd_serve (5m)

o: gd_serve_bdb_int,gd_serve_int (5m)[priority: 500]

GD_SERVE_INT_CSi: gd_serve_int (5m)o: gd_serve_int (5m)

[latency: 30][priority: 500]

GD_SERVE_BDB_CSi: gd_serve_bdb_int (5m)o: gd_serve_bdb_int (5m)

[latency: 30][priority: 500]

GD_SERVE_BDB_INITi: gd_serve (5m)[priority: 500]

ANNOTATED_KS_CLICK_HOURLY_STATS_CSi: annotated_ks_click (hourly)o: annotated_ks_click (hourly)

[latency: 90][priority: 500]

ANNOTATED_KS_CLICK_HOURLY_STATSi: annotated_ks_click (hourly)o: annotated_ks_click (hourly)

[priority: 500]

ER_LINE_CLICK_IMPR_NGD_QUERYi: adv_ep_report_ngd (15m)

o: er_line_click_impr_ngd (15m)[latency: 60]

[priority: 400]

ER_LINE_CLICK_IMPR_MERGE_INITi: er_line_click_impr,er_line_click_impr_ngd (15m)

[priority: 200]

ER_LINE_CLICK_IMPR_NGD_INITi: adv_ep_report_ngd (15m)

[priority: 400]

LOF_FETCHER_GD_5M[latency: 20]

[priority: 500]

3

3PI_BID_PROC_F_STATSi: 3pi_bid_proc_fail (15m)o: 3pi_bid_proc_fail (15m)

[priority: 100]

3PI_BID_PROC_F_QUERYi: ngd_serve_3pi (5m)

o: 3pi_bid_proc_fail (15m)[latency: 60]

[priority: 100]

TPLLODS_CHECKS_POST_TP_NGD_SERVEi: annotated_gd_impression (5m)

o: post_tp_annotated_gd_impression (5m)[latency: 40]

[priority: 500]

NGD_SERVE_BDB_INITi: post_tp_ngd_serve (5m)

[priority: 500]

NGD_PREAGG_INITi: post_tp_ngd_serve,post_tp_ngd_click,ngd_conversion (5m)

[priority: 100]

NGD_PREDICT_PREAGG_INITi: post_tp_ngd_serve,post_tp_ngd_click,ngd_conversion (5m)

[priority: 100]

SQM_NGD_SERVEURL_IMPR_HOURLY_INITi: post_tp_ngd_serve (5m)

[priority: 100]

TP_NGD_SERVE_INITi: ngd_serve (5m)

[priority: 500]

ER_NETWORK_CLICK_IMPR_INITi: network_report,network_report_smp (15m)

[priority: 400]

ER_NETWORK_CLICK_IMPR_QUERYi: network_report,network_report_smp (15m)

o: er_network_click_impr (15m)[latency: 60]

[priority: 400]

NETWORK_REPORT_QUERYi: gd_preagg (5m)

o: network_report (15m)[latency: 60]

[priority: 400]

IR_ADV_NET_PUB_INITi: network_report (15m)

[priority: 400]

NETWORK_REPORT_SMP_QUERYi: gd_preagg (5m)

o: network_report_smp (15m)[latency: 60]

[priority: 400]

ER_NETWORK_CLICK_IMPR_MERGE_STATSi: er_network_click_impr_merged (15m)o: er_network_click_impr_merged (15m)

[priority: 200]

ER_NETWORK_CLICK_IMPR_MERGE_STATS_CSi: er_network_click_impr_merged (15m)o: er_network_click_impr_merged (15m)

[latency: 60][priority: 200]

ER_NETWORK_CLICK_IMPR_MERGE_QUERYi: er_network_click_impr,er_network_click_impr_ngd (15m)

o: er_network_click_impr_merged (15m)[priority: 200]

ER_NETWORK_CLICK_IMPR_MERGE_AMDi: gd_impr,gd_click (5m)

o: er_network_click_impr_merged (15m)[latency: 60]

[priority: 200]

GD_SERVE_CSi: gd_serve (5m)o: gd_serve (5m)

[latency: 25][priority: 500]

DH_DATA_VALIDATION_LOF_FETCHER_5M[priority: 500]

GD_SERVE_ROLLUP_INIT[priority: 500]

ACT_EXCH_RB_SEG_INITi: gd_serve,seg_beacon,ngd_serve (5m)

[priority: 500]

ACT_SRV_TGTSRV_HR_INITi: gd_serve (5m)[priority: 500]

GD_COB[priority: 500]

YOO_GD_SERVE_CSi: yoo_gd_serve (5m)o: yoo_gd_serve (5m)

[latency: 25][priority: 500]

ACT_CLICKS_TGTCLICKS_HOURLY_QUERYi: annotated_gd_click (5m)

o: act_apex_clicks,act_apex_targeted_clicks (hourly)[latency: 60]

[priority: 500]

ACT_CLICKS_TGTCLICKS_HOURLY_INITi: annotated_gd_click (5m)

[priority: 500]

NGD_SERVE_BDBi: post_tp_ngd_serve (5m)

o: ngd_serve_bdb_int,ngd_serve_int (5m)[priority: 500]

NGD_SERVE_INT_CSi: ngd_serve_int (5m)o: ngd_serve_int (5m)

[latency: 30][priority: 500]

NGD_SERVE_BDB_CSi: ngd_serve_bdb_int (5m)o: ngd_serve_bdb_int (5m)

[latency: 30][priority: 500]

YOO_GD_CLICK_CSi: yoo_gd_click (5m)o: yoo_gd_click (5m)

[latency: 25][priority: 500]

YOO_GD_CLICK_SORTED_INITi: yoo_gd_click (5m)

[priority: 100]

BATCH_COB[priority: 500]

GD_IMPR_CSi: gd_impr (5m)o: gd_impr (5m)

[latency: 25][priority: 500]

GD_CLICK_CSi: gd_click (5m)o: gd_click (5m)

[latency: 25][priority: 500]

3PI_BID_PROC_F_QUERY_STATS_CSi: 3pi_bid_proc_fail (15m)o: 3pi_bid_proc_fail (15m)

[latency: 60][priority: 100]

AM_NGD_STATSi: am_ngd (15m)o: am_ngd (15m)

[priority: 500]

AM_NGD_QUERY_STATS_CSi: am_ngd (15m)o: am_ngd (15m)

[latency: 120][priority: 500]

AM_NGD_QUERY[priority: 500]

SOX_AM_NGD_INITi: am_ngd (15m)

[priority: 500]

CM_PREAGG_INITi: annotated_gd_cm (5m)

[priority: 500]

SOX_AM_GD_DEF_METRICS_CHECKi: sox_metrics_impr (5m)

[priority: 500]

SOX_METRICS_FOR_AMi: sox_metrics_impr (5m)

[latency: 20][priority: 500]

SOX_METRICS_GD_IMPRi: post_tp_annotated_gd_impression (5m)

o: sox_metrics_impr (5m)[latency: 20]

[priority: 500]

SOX_METRICS_HOURLY_ROLLUP_INITi: sox_metrics_impr (5m)

[priority: 500]

GD_SERVE_ROLLUP_STATS[priority: 500]

GD_SERVE_ROLLUP_STATS_CS[priority: 500]

GD_SERVE_ROLLUP_QUERY[priority: 500]

DEFINITIVE_METRICS_GD_QS_CHECK_15M[priority: 500]

DEFINITIVE_METRICS_VALIDATE_QS_WORKER_15M[priority: 500]

MME_QS_QUERYi: gd_preagg (5m)

o: gd_quota_server (15m)[priority: 500]

MME_QS_STATSi: gd_quota_server (15m)o: gd_quota_server (15m)

[priority: 500]

MME_QS_AMDi: gd_impr,gd_click (5m)o: gd_quota_server (15m)

[latency: 500][priority: 500]

ANNOTATED_KS_CLICK_HOURLY_INITi: annotated_ks_click (5m)

[priority: 500]

ANNOTATED_KS_CLICK_HOURLY_QUERYi: annotated_ks_click (5m)

o: annotated_ks_click (hourly)[latency: 90]

[priority: 500]

POST_MAPPING_ANNOTATED_KS_CLICK_INITi: annotated_ks_click (5m)

[priority: 300]

KS_PREAGG_INITi: ks_serve (5m)

i: annotated_ks_click (5m)[priority: 300]

IR_ADV_NET_PUB_MERGE_QUERY_HOURLY_STATSi: ir_adv_net_pub_merged (hourly)o: ir_adv_net_pub_merged (hourly)

[priority: 200]

IR_ADV_NET_PUB_MERGE_QUERY_HOURLY_STATS_CSi: ir_adv_net_pub_merged (hourly)o: ir_adv_net_pub_merged (hourly)

[latency: 60][priority: 200]

IR_ADV_NET_PUB_MERGE_QUERY_HOURLYi: ir_adv_net_pub,ir_adv_net_pub_ngd,ir_adv_net_pub_ks (hourly)

o: ir_adv_net_pub_merged (hourly)[priority: 200]

IR_ADV_NET_PUB_MERGE_AMDi: gd_impr,gd_click (5m)

o: ir_adv_net_pub_merged (hourly)[latency: 60]

[priority: 200]

KS_SERVE_ROLLUP_STATS[priority: 500]

KS_SERVE_ROLLUP_STATS_CS[priority: 500]

KS_SERVE_ROLLUP_QUERY[priority: 500]

TP_GD_SERVE_CLICK_INITi: annotated_gd_click (5m)

[priority: 500]

TPLLODS_GD_CLICKi: annotated_gd_click (5m)

o: post_tp_annotated_gd_click (5m)[latency: 40]

[priority: 500]

TP_NGD_CLICK_INITi: annotated_ngd_click (5m)

[priority: 500]

TPLLODS_CHECKS_POST_TP_NGD_CLICKi: annotated_ngd_click (5m)o: post_tp_ngd_click (5m)

[latency: 40][priority: 500]

ER_LINE_CLICK_IMPR_INITi: adv_ep_report (5m,15m)

[priority: 400]

ER_LINE_CLICK_IMPR_QUERYi: adv_ep_report (15m)

o: er_line_click_impr (15m)[latency: 60]

[priority: 400]

ADV_EP_REPORT_QUERYi: gd_preagg (5m)

o: adv_ep_report (15m)[latency: 60]

[priority: 400]

IR_ADV_PERF_INIT_HOURLYi: adv_ep_report (15m)

[priority: 400]

APEX_AUDIT_LOG_STATSi: apex_audit_log (5m)o: apex_audit_log (5m)

[priority: 300]

APEX_AUDIT_LOG_QUERY_STATS_CSi: apex_audit_log (5m)o: apex_audit_log (5m)

[latency: 25][priority: 300]

AUDIT_LOG_CSi: apex_audit_log (5m)o: apex_audit_log (5m)

[latency: 25][priority: 300]

3PI_BID_PROC_F_INITi: ngd_serve_3pi (5m)

[priority: 100]

NGD_SERVE_3PI_QUERYi: ngd_serve (5m)

o: ngd_serve_3pi (5m)[latency: 30]

[priority: 100]

AM_KS_STATSi: am_ks (15m)o: am_ks (15m)[priority: 500]

AM_KS_QUERY_STATS_CSi: am_ks (15m)o: am_ks (15m)[latency: 120][priority: 500]

AM_KS_QUERYi: ks_preagg (5m)o: am_ks (15m)[latency: 120][priority: 500]

SOX_AM_KS_INITi: am_ks (15m)[priority: 500]

AM_GD_STATSi: am_gd (15m)o: am_gd (15m)[priority: 500]

AM_GD_QUERY_STATS_CSi: am_gd (15m)o: am_gd (15m)[latency: 120][priority: 500]

SOX_VALIDATE_AM_GD[priority: 500]

LOF_FETCHER_DEFAULT_5M[latency: 20]

[priority: 500]

NGD_SERVE_3PI_INITi: ngd_serve (5m)

[priority: 100]

TPLLODS_CHECKS_POST_TP_KS_CLICKi: ks_click (5m)

o: post_tp_ks_click (5m)[latency: 40]

[priority: 500]

POST_MAPPING_KS_CLICK_INITi: post_tp_ks_click (5m)

[priority: 300]

SOX_METRICS_KS_INITi: ks_preagg (5m)

[priority: 500]

PREDICT_CORE_STATSi: ngd_predict_core (30m)o: ngd_predict_core (30m)

[priority: 500]

PREDICT_CORE_QUERY_STATS_CSi: ngd_predict_core (30m)o: ngd_predict_core (30m)

[latency: 120][priority: 500]

PREDICT_CORE_QUERYi: ngd_predict_preagg (5m)o: ngd_predict_core (30m)

[latency: 120][priority: 500]

ER_CREATIVE_CLICK_IMPR_MERGE_STATSi: er_creative_click_impr_merged (15m)o: er_creative_click_impr_merged (15m)

[priority: 200]

ER_CREATIVE_CLICK_IMPR_MERGE_STATS_CSi: er_creative_click_impr_merged (15m)o: er_creative_click_impr_merged (15m)

[latency: 60][priority: 200]

ER_CREATIVE_CLICK_IMPR_MERGE_QUERYi: er_creative_click_impr,er_creative_click_impr_ngd (15m)

o: er_creative_click_impr_merged (15m)[priority: 200]

ER_CREATIVE_CLICK_IMPR_MERGE_AMDi: gd_impr,gd_click (5m)

o: er_creative_click_impr_merged (15m)[latency: 60]

[priority: 200]

TPLLODS_CHECKS_POST_TP_ANNOTATED_GD_IMPRESSIONi: annotated_gd_impression (5m)

o: post_tp_annotated_gd_impression (5m)[latency: 40]

[priority: 500]

SOX_METRICS_GD_IMPR_INITi: post_tp_annotated_gd_impression (5m)

[priority: 500]

POST_TP_DEFINITIVE_METRICS_INIT_5M[priority: 500]

PREAGG_GD_INITi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m)

[priority: 100]

SQM_GD_SERVEURL_IMPR_HOURLY_INITi: post_tp_annotated_gd_impression (5m)

[priority: 100]

IMS_MOROCCO_INITi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m,5m)

[priority: 500]

IMS_QUOTA_SERVER_INITi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m,5m)

[priority: 100]

IMS_INITi: post_tp_annotated_gd_impression (5m)

[priority: 100]

TP_GD_SERVE_IMPR_INITi: annotated_gd_impression (5m)

[priority: 500]

IMS_YOO_INITi: yoo_gd_serve_sorted (hourly)

[priority: 100]

IMS_YOOi: yoo_gd_serve_sorted (hourly)

o: ims_yoo (hourly)[latency: 60]

[priority: 100]

YOO_GD_SERVE_SORTEDi: post_tp_yoo_gd_serve (5m)

o: yoo_gd_serve_sorted (hourly)[latency: 60]

[priority: 100]

ACT_YOO_CLICKS_TGTCLICKS_HOURLY_INITi: yoo_gd_serve_sorted,yoo_gd_click_sorted (hourly)

[priority: 100]

ACT_YOO_SRV_TGTSRV_HR_INITi: yoo_gd_serve_sorted (hourly)

[priority: 100]

MME_QS_STATS_CSi: gd_quota_server (15m)o: gd_quota_server (15m)

[latency: 500][priority: 500]

SOX_AM_KS_METRICSi: am_ks (15m)

[latency: 20][priority: 500]

SOX_VALIDATE_AM_KS[priority: 500]

IR_ADV_PERF_QUERY_HOURLYi: adv_ep_report (15m)o: ir_adv_perf (hourly)

[latency: 60][priority: 400]

ER_BOOKING_CLICK_IMPR_MERGE_INITi: er_booking_click_impr,er_booking_click_impr_ngd (15m)

[priority: 200]

ER_BOOKING_CLICK_IMPR_MERGE_QUERYi: er_booking_click_impr,er_booking_click_impr_ngd (15m)

o: er_booking_click_impr_merged (15m)[priority: 200]

ER_BOOKING_CLICK_IMPR_NGD_QUERYi: pub_ep_report_ngd (15m)

o: er_booking_click_impr_ngd (15m)[latency: 60]

[priority: 400]

KS_CLICK_CSi: ks_click (5m)[priority: 300]

KS_CLICK_INITi: ks_click (5m)[priority: 300]

KS_SERVE_CSi: ks_serve (5m)[priority: 300]

KS_SERVE_ROLLUP_INIT[priority: 500]

POST_MAPPING_KS_SERVE_INITi: ks_serve (5m)[priority: 300]

TPLLODS_YOO_GD_SERVE_INITi: yoo_gd_serve (5m)

[priority: 500]

SEG_BEACON_CSi: seg_beacon (5m)o: seg_beacon (5m)

[latency: 25][priority: 300]

AM_GD_INITi: gd_preagg (5m)

[priority: 500]

AM_GD_QUERY[priority: 500]

PREAGG_GD_QUERYi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m)

o: gd_preagg (5m)[latency: 25]

[priority: 100]

CM_GD_PREAGG_INITi: gd_preagg,cm_preagg (5m)

[priority: 500]

OF_GD_ORDER_HOURLY_INITi: gd_preagg (5m)

[priority: 100]

ER_CLICK_IMPR_INITi: gd_preagg (5m)

[priority: 400]

IR_PATH_PERF_INIT_HOURLYi: gd_preagg (5m)

[priority: 400]

MME_QS_INITi: gd_preagg (5m)

[priority: 500]

ER_LINE_CLICK_IMPR_MERGE_STATSi: er_line_click_impr_merged (15m)o: er_line_click_impr_merged (15m)

[priority: 200]

ER_LINE_CLICK_IMPR_MERGE_STATS_CSi: er_line_click_impr_merged (15m)o: er_line_click_impr_merged (15m)

[latency: 60][priority: 200]

ER_LINE_CLICK_IMPR_MERGE_QUERYi: er_line_click_impr,er_line_click_impr_ngd (15m)

o: er_line_click_impr_merged (15m)[priority: 200]

ER_LINE_CLICK_IMPR_MERGE_AMDi: gd_impr,gd_click (5m)

o: er_line_click_impr_merged (15m)[latency: 60]

[priority: 200]

CM_GD_PREAGG_QUERYi: gd_preagg,cm_preagg (5m)

o: cm_gd_preagg (15m)[latency: 60]

[priority: 500]

ER_CM_CLICK_IMPR_GD_INITi: cm_gd_preagg (5m,15m)

[priority: 500]

POST_MAPPING_ANNOTATED_KS_CLICK_QUERYi: annotated_ks_click (5m)

o: post_mapping_annotated_ks_click (5m)[priority: 300]

OF_GD_ORDER_HOURLY_QUERYi: gd_preagg (5m)

o: of_gd_order (hourly)[latency: 60]

[priority: 100]

QBP_REVENUE_HOURLY_STATSi: qbp_revenue (hourly)o: qbp_revenue (hourly)

[priority: 500]

QBP_REVENUE_HOURLY_STATS_CSi: qbp_revenue (hourly)o: qbp_revenue (hourly)

[latency: 60][priority: 500]

QBP_REVENUE_HOURLY_QUERYi: annotated_ks_click (hourly)

[latency: 60][priority: 500]

ER_BOOKING_CLICK_IMPR_KS_INITi: pub_ep_report_ks (15m)

[priority: 400]

TPLLODS_YOO_GD_SERVEi: yoo_gd_serve (5m)

o: post_tp_yoo_gd_serve (5m)[latency: 40]

[priority: 500]

IR_ADV_PERF_MERGE_QUERY_HOURLY_STATSi: ir_adv_perf_merged (hourly)o: ir_adv_perf_merged (hourly)

[priority: 200]

IR_ADV_PERF_MERGE_QUERY_HOURLY_STATS_CSi: ir_adv_perf_merged (hourly)o: ir_adv_perf_merged (hourly)

[latency: 60][priority: 200]

IR_ADV_PERF_MERGE_QUERY_HOURLYi: ir_adv_perf,ir_adv_perf_ngd (hourly)

o: ir_adv_perf_merged (hourly)[priority: 200]

IR_ADV_PERF_MERGE_AMDi: gd_impr,gd_click (5m)

o: ir_adv_perf_merged (hourly)[latency: 60]

[priority: 200]

DQM_IR_FEED_DATA_CHECKi: ir_adv_perf_merged (hourly)

[priority: 500]

DEFAULT_COB[priority: 500]

CREATIVE_METRIC_CSi: creative_metric (5m)o: creative_metric (5m)

[latency: 25][priority: 300]

3PI_BID_PROC_BASIC_INITi: ngd_serve_3pi (5m)

[priority: 100]

DQM_HIGH_RISK_CREATIVES_STATSi: dqm_high_risk_creatives (hourly)o: dqm_high_risk_creatives (hourly)

[priority: 500]

DQM_HIGH_RISK_CREATIVES_STATS_CSi: dqm_high_risk_creatives (hourly)o: dqm_high_risk_creatives (hourly)

[latency: 60][priority: 500]

DQM_HIGH_RISK_CREATIVES_QUERYi: dqm_crtv_metrics_rolling (hourly)o: dqm_high_risk_creatives (hourly)

[latency: 60][priority: 500]

DQM_CRTV_REPORTED_QUERYi: dqm_high_risk_creatives (hourly)

o: dqm_crtv_reported (hourly)[latency: 60]

[priority: 500]

NETWORK_REPORT_SMP_NGD_QUERYi: ngd_preagg (5m)

o: network_report_smp_ngd (15m)[latency: 60]

[priority: 400]

ER_NETWORK_CLICK_IMPR_NGD_INITi: network_report_ngd,network_report_smp_ngd (15m)

[priority: 400]

NETWORK_REPORT_NGD_QUERYi: ngd_preagg (5m)

o: network_report_ngd (15m)[latency: 60]

[priority: 400]

ER_CREATIVE_CLICK_IMPR_NGD_QUERYi: ngd_preagg (5m)

o: er_creative_click_impr_ngd (15m)[latency: 60]

[priority: 400]

ADV_EP_REPORT_NGD_QUERYi: ngd_preagg (5m)

o: adv_ep_report_ngd (15m)[latency: 60]

[priority: 400]

PUB_EP_REPORT_NGD_QUERYi: ngd_preagg (5m)

o: pub_ep_report_ngd (15m)[latency: 60]

[priority: 400]

KS_PREAGG_QUERYi: ks_serve (5m)

i: annotated_ks_click (5m)o: ks_preagg (5m)

[latency: 25][priority: 300]

ER_NETWORK_CLICK_IMPR_NGD_QUERYi: network_report_ngd,network_report_smp_ngd (15m)

o: er_network_click_impr_ngd (15m)[latency: 60]

[priority: 400]

IR_ADV_NET_PUB_NGD_INITi: network_report_ngd (15m)

[priority: 200]

SQM_SITE_METRICS_HOURLY_INITi: er_booking_click_impr_merged (15m)

[priority: 100]

SQM_SITE_METRICS_HOURLY_QUERYi: er_booking_click_impr_merged (15m)

o: sqm_site_metrics (hourly)[latency: 60]

[priority: 100]

ER_BOOKING_CLICK_IMPR_MERGE_STATSi: er_booking_click_impr_merged (15m)o: er_booking_click_impr_merged (15m)

[priority: 200]

ER_BOOKING_CLICK_IMPR_MERGE_AMDi: gd_impr,gd_click (5m)

o: er_booking_click_impr_merged (15m)[latency: 60]

[priority: 200]

TPLLODS_KS_CLICK_INITi: ks_click (5m)[priority: 500]

ER_BOOKING_CLICK_IMPR_MERGE_STATS_CSi: er_booking_click_impr_merged (15m)o: er_booking_click_impr_merged (15m)

[latency: 60][priority: 200]

POST_MAPPING_KS_CLICK_QUERYi: post_tp_ks_click (5m)

o: post_mapping_ks_click (5m)[priority: 300]

IMS_MOROCCO_STATSi: ims_morocco (hourly)o: ims_morocco (hourly)

[priority: 500]

IMS_MOROCCO_QUERY_STATS_CSi: ims_morocco (hourly)o: ims_morocco (hourly)

[latency: 60][priority: 500]

IMS_MOROCCO_QUERYi: post_tp_annotated_gd_click,post_tp_annotated_gd_impression (5m,5m)

o: ims_morocco (hourly)[latency: 60]

[priority: 500]

3PI_BID_PROC_BASIC_STATSi: 3pi_bid_proc_basic (15m)o: 3pi_bid_proc_basic (15m)

[priority: 100]

3PI_BID_PROC_BASIC_QUERYi: ngd_serve_3pi (5m)

o: 3pi_bid_proc_basic (15m)[latency: 60]

[priority: 100]

DH_DEFINITIVE_METRICS_CS_5Mi: dh_definitive_metrics (5m)o: dh_definitive_metrics (5m)

[latency: 25][priority: 500]

ER_REPORTS_KS_INITi: ks_preagg (5m)

[priority: 400]

AM_KS_INITi: ks_preagg (5m)

[priority: 500]

IR_ADV_PERF_KS_INIT_HOURLYi: ks_preagg (5m)

[priority: 200]

IR_PATH_PERF_KS_INIT_HOURLYi: ks_preagg (5m)

[priority: 200]

KS_PREAGG_HOURLY_INITi: ks_preagg (5m)

[latency: 90][priority: 300]

POST_MAPPING_KS_SERVE_QUERYi: ks_serve (5m)

o: post_mapping_ks_serve (5m)[priority: 300]

ACT_EXCH_RB_SEG_STATSi: act_exchange_rb_segments (hourly)o: act_exchange_rb_segments (hourly)

[priority: 500]

ACT_EXCH_RB_SEG_STATS_CSi: act_exchange_rb_segments (hourly)o: act_exchange_rb_segments (hourly)

[latency: 60][priority: 500]

ACT_EXCH_RB_SEGi: act_exchange_rb_segments_int (hourly)

o: act_exchange_rb_segments (hourly)[latency: 60]

[priority: 500]

SOX_METRICS_NGD_IMPR_QUERYi: ngd_preagg (5m)

o: sox_metrics_ngd_impr (5m)[latency: 20]

[priority: 500]

SOX_METRICS_NGD_CLICK_QUERYi: ngd_preagg (5m)

o: sox_metrics_ngd_click (5m)[latency: 20]

[priority: 500]

SOX_METRICS_NGD_CONV_QUERYi: ngd_preagg (5m)

o: sox_metrics_ngd_conv (5m)[latency: 20]

[priority: 500]

ER_CREATIVE_CLICK_IMPR_QUERYi: gd_preagg (5m)

o: er_creative_click_impr (15m)[latency: 60]

[priority: 400]

SOX_AM_NGD_DEF_METRICS_CHECKi: sox_metrics_ngd_impr,sox_metrics_ngd_click,sox_metrics_ngd_conv (5m)

[priority: 500]

SOX_METRICS_NGD_HOURLY_ROLLUP_INITi: sox_metrics_ngd_impr,sox_metrics_ngd_click,sox_metrics_ngd_conv (5m)

[priority: 500]

IR_PUB_PERF_MERGE_QUERY_HOURLY_STATSi: ir_pub_perf_merged (hourly)o: ir_pub_perf_merged (hourly)

[priority: 200]

IR_PUB_PERF_MERGE_QUERY_HOURLY_STATS_CSi: ir_pub_perf_merged (hourly)o: ir_pub_perf_merged (hourly)

[latency: 60][priority: 200]

IR_PUB_PERF_MERGE_QUERY_HOURLYi: ir_pub_perf,ir_pub_perf_ngd,ir_pub_perf_ks (hourly)

o: ir_pub_perf_merged (hourly)[priority: 200]

IR_PUB_PERF_MERGE_AMDi: gd_impr,gd_click (5m)

o: ir_pub_perf_merged (hourly)[latency: 60]

[priority: 200]

NGD_PREDICT_PREAGG_QUERY_COMPLETEi: post_tp_ngd_serve,post_tp_ngd_click,ngd_conversion (5m)

o: ngd_predict_preagg (5m)[latency: 60]

[priority: 100]

PREDICT_PEARL1_HOURLY_INITi: ngd_predict_preagg (5m)

[priority: 100]

PREDICT_PEARL2_HOURLY_INITi: ngd_predict_preagg (5m)

[priority: 100]

PREDICT_CORE_INITi: ngd_predict_preagg (5m)

[priority: 500]

NGD_RECONCILER_HOURLY_INITi: ngd_predict_preagg (5m)

[priority: 100]

PREDICT_DAILYVOL_HOURLY_INITi: ngd_predict_preagg (5m)

[priority: 100]

NGD_RECONCILER_LZ2_HOURLY_INITi: ngd_predict_preagg (5m)

[priority: 100]

DEFINITIVE_METRICS_ER_LINE_CLICK_IMPR_CHECK_15M[priority: 500]

ER_NETWORK_CLICK_IMPR_MERGE_INITi: er_network_click_impr,er_network_click_impr_ngd (15m)

[priority: 200]

AM_GD_AMDi: gd_impr (5m)o: am_gd (15m)[latency: 120][priority: 500]

SOX_AM_GD_INITi: am_gd (15m)[priority: 500]

TERMINAL[priority: 500]

QBP_REVENUE_HOURLY_INITi: annotated_ks_click (hourly)

[priority: 500]

KS_CLICK_BIDDED_HOURLY_INITi: annotated_ks_click,cm_click_bidded_terms (hourly)

[priority: 500]

PREDICT_PEARL1_HOURLY_QUERYi: ngd_predict_preagg (5m)

o: ngd_predict_pearl1 (hourly)[latency: 60]

[priority: 100]

ACT_YOO_CLICKS_TGTCLICKS_HOURLYi: yoo_gd_serve_sorted,yoo_gd_click_sorted (hourly)

o: act_yoo_clicks,act_yoo_targeted_clicks (hourly)[latency: 60]

[priority: 100]

YOO_GD_CLICK_SORTEDi: yoo_gd_click (5m)

o: yoo_gd_click_sorted (hourly)[latency: 60]

[priority: 100]

PUB_EP_REPORT_KS_QUERYi: ks_preagg (5m)

o: pub_ep_report_ks (15m)[latency: 60]

[priority: 400]

NETWORK_REPORT_KS_QUERYi: ks_preagg (5m)

o: network_report_ks (15m)[latency: 60]

[priority: 400]

SOX_OF_GD_HOURLY_METRICSi: of_gd_order (hourly)

[latency: 20][priority: 500]

SOX_VALIDATE_OF_GD_HOURLY[priority: 500]

SOX_OF_GD_HOURLY_INITi: of_gd_order (hourly)

[priority: 500]

ER_CREATIVE_CLICK_IMPR_MERGE_INITi: er_creative_click_impr,er_creative_click_impr_ngd (15m)

[priority: 200]

DQM_REPORTED_DATA_CHECK[priority: 500]

PREDICT_PEARL2_HOURLY_QUERYi: ngd_predict_preagg (5m)

o: ngd_predict_pearl2 (hourly)[latency: 60]

[priority: 100]

LOF_FETCHER_BATCH_5M[latency: 20]

[priority: 500]

3PI_BID_PROC_BASIC_QUERY_STATS_CSi: 3pi_bid_proc_basic (15m)o: 3pi_bid_proc_basic (15m)

[latency: 60][priority: 100]

LOF_FETCHER_NGD_5M[latency: 20]

[priority: 500]

DQM_ROLLING_METRICS_DATA_CHECK[priority: 500]

DQM_ROLLING_AGGREGATION_QUERYi: ir_adv_perf_merged (hourly)

o: dqm_crtv_metrics_rolling (hourly)[latency: 60]

[priority: 500]

IR_PATH_PERF_QUERY_HOURLYi: gd_preagg (5m)

o: ir_path_perf (hourly)[latency: 60]

[priority: 400]

POST_TP_DEFINITIVE_METRICS_WORKER_5M[priority: 500]

POST_TP_DEFINITIVE_METRICS_CHECK_15M[priority: 500]

IR_PATH_PERF_MERGE_INITi: ir_path_perf,ir_path_perf_ngd,ir_path_perf_ks (hourly)

[priority: 200]

ACCOUNT_PARTITION_MAP

NGD_RECONCILER_HOURLY_QUERYi: ngd_predict_preagg (5m)o: ngd_reconciler (hourly)

[latency: 60][priority: 100]

SOX_METRICS_FOR_AM_KSi: sox_metrics_ks_click (5m)

[latency: 20][priority: 500]

YOO_GD_SERVE_SORTED_INITi: post_tp_yoo_gd_serve (5m)

[priority: 100]

SOX_METRICS_KS_CLICK_QUERYi: post_tp_ks_click (5m)

o: sox_metrics_ks_click (5m)[latency: 20]

[priority: 500]

SOX_AM_KS_DEF_METRICS_CHECKi: sox_metrics_ks_click (5m)

[priority: 500]

SOX_METRICS_KS_HOURLY_ROLLUP_INITi: sox_metrics_ks_click (5m)

[priority: 500]

IR_ADV_PERF_NGD_INIT_HOURLYi: adv_ep_report_ngd (15m)

[priority: 200]

SOX_VALIDATE_AM_NGD[priority: 500]

SOX_AM_NGD_METRICSi: am_ngd (15m)

[latency: 20][priority: 500]

SOX_METRICS_FOR_AM_NGDi: sox_metrics_ngd_impr,sox_metrics_ngd_click,sox_metrics_ngd_conv (5m)

[latency: 20][priority: 500]

KS_CLICK_BIDDED_HOURLY_CFIo: cm_click_bidded_terms (hourly)

[latency: 60][priority: 500]

ACT_EXCH_RB_SEG_INTi: gd_serve,seg_beacon,ngd_serve (5m)

o: act_exchange_rb_segments_int (hourly)[latency: 60]

[priority: 500]

DEFINITIVE_METRICS_VALIDATE_ER_WORKER_15M[priority: 500]

SOX_VALIDATE_OF_NGD_HOURLY[priority: 500]

SOX_OF_NGD_HOURLY_METRICSi: of_ngd_order (hourly)

[latency: 20][priority: 500]

SOX_METRICS_FOR_OF_NGD_HOURLYi: sox_metrics_ngd (hourly)

[latency: 20][priority: 500]

SOX_AM_GD_METRICSi: am_gd (15m)

[latency: 20][priority: 500]

IR_PATH_PERF_MERGE_QUERY_HOURLY_STATSi: ir_path_perf_merged (hourly)o: ir_path_perf_merged (hourly)

[priority: 200]

IR_PATH_PERF_MERGE_QUERY_HOURLY_STATS_CSi: ir_path_perf_merged (hourly)o: ir_path_perf_merged (hourly)

[latency: 60][priority: 200]

IR_PATH_PERF_MERGE_QUERY_HOURLYi: ir_path_perf,ir_path_perf_ngd,ir_path_perf_ks (hourly)

o: ir_path_perf_merged (hourly)[priority: 200]

IR_PATH_PERF_MERGE_AMDi: gd_impr,gd_click (5m)

o: ir_path_perf_merged (hourly)[latency: 60]

[priority: 200]

CM_PREAGG_QUERYi: annotated_gd_cm (5m)

o: cm_preagg (5m)[latency: 45]

[priority: 500]

SQM_GD_SERVEURL_IMPR_HOURLY_QUERYi: post_tp_annotated_gd_impression (5m)

o: sqm_gd_serveurl_impr (hourly)[latency: 60]

[priority: 100]

IR_PUB_PERF_KS_INIT_HOURLYi: pub_ep_report_ks (15m)

[priority: 200]

KS_OFFER_BIDDED_HOURLY_EXP_REPORTING_TAG_QUERYi: ks_preagg (hourly)o: ks_offer (hourly)

[latency: 60][priority: 500]

KS_OFFER_BIDDED_HOURLY_QUERYi: ks_offer,cm_serve_bidded_terms (hourly)

o: ks_offer_bidded (hourly)[latency: 60]

[priority: 500]

KS_OFFER_BIDDED_HOURLY_INITi: ks_preagg (hourly)

[priority: 500]

IR_ADV_PERF_KS_QUERY_HOURLYi: ks_preagg (5m)

o: ir_adv_perf_ks (hourly)[latency: 60]

[priority: 200]

LATE_DATA_PROCESSOR_BATCH

IR_PATH_PERF_KS_QUERY_HOURLYi: ks_preagg (5m)

o: ir_path_perf_ks (hourly)[latency: 60]

[priority: 200]

IR_PUB_PERF_NGD_INIT_HOURLYi: pub_ep_report_ngd (15m)

[priority: 200]

ER_BOOKING_CLICK_IMPR_NGD_INITi: pub_ep_report_ngd (15m)

[priority: 400]

IR_ADV_NET_PUB_KS_INITi: network_report_ks (15m)

[priority: 200]

SQM_NGD_SERVEURL_IMPR_HOURLY_QUERYi: post_tp_ngd_serve (5m)

o: sqm_ngd_serveurl_impr (hourly)[latency: 60]

[priority: 100]

IR_PUB_PERF_NGD_QUERY_HOURLYi: pub_ep_report_ngd (15m)o: ir_pub_perf_ngd (hourly)

[latency: 60][priority: 200]

SOX_METRICS_KS_HOURLY_ROLLUPi: sox_metrics_ks_click (5m)o: sox_metrics_ks (hourly)

[latency: 60][priority: 500]

IR_ADV_PERF_MERGE_INITi: ir_adv_perf,ir_adv_perf_ngd,ir_adv_perf_ks (hourly)

[priority: 200]

SOX_METRICS_FOR_OF_HOURLYi: sox_metrics_impr (hourly)

[latency: 20][priority: 500]

SOX_OF_GD_HOURLY_DEF_METRICS_CHECKi: sox_metrics_impr (hourly)

[priority: 500]

DQM_ROLLING_DATA_CHECKi: dqm_crtv_metrics_rolling (hourly)

[priority: 500]

IR_ADV_NET_PUB_QUERYi: network_report (15m)

o: ir_adv_net_pub (hourly)[latency: 60]

[priority: 400]

IR_ADV_NET_PUB_MERGE_INITi: ir_adv_net_pub,ir_adv_net_pub_ngd,ir_adv_net_pub_ks (hourly)

[priority: 200]

IR_ADV_PERF_NGD_QUERY_HOURLYi: adv_ep_report_ngd (15m)o: ir_adv_perf_ngd (hourly)

[latency: 60][priority: 200]

IR_PUB_PERF_MERGE_INITi: ir_pub_perf,ir_pub_perf_ngd,ir_pub_perf_ks (hourly)

[priority: 200]

IR_ADV_NET_PUB_NGD_QUERYi: network_report_ngd (15m)

o: ir_adv_net_pub_ngd (hourly)[latency: 60]

[priority: 200]

SOX_OF_NGD_HOURLY_DEF_METRICS_CHECKi: sox_metrics_ngd (hourly)

[priority: 500]

SOX_METRICS_NGD_HOURLY_ROLLUPi: sox_metrics_ngd_impr,sox_metrics_ngd_click,sox_metrics_ngd_conv (5m)

o: sox_metrics_ngd (hourly)[latency: 60]

[priority: 500]

IR_PUB_PERF_QUERY_HOURLYi: pub_ep_report (15m)o: ir_pub_perf (hourly)

[latency: 60][priority: 400]

IR_PUB_PERF_KS_QUERY_HOURLYi: pub_ep_report_ks (15m)o: ir_pub_perf_ks (hourly)

[latency: 60][priority: 200]

IR_ADV_NET_PUB_KS_QUERYi: network_report_ks (15m)

o: ir_adv_net_pub_ks (hourly)[latency: 60]

[priority: 200]

PREDICT_DAILYVOL_HOURLY_QUERYi: ngd_predict_preagg (5m)

o: ngd_predict_dailyvol (hourly)[latency: 60]

[priority: 100]

OF_NGD_ORDER_HOURLY_QUERYi: ngd_preagg (5m)

o: of_ngd_order (hourly)[latency: 60]

[priority: 100]

SOX_OF_NGD_HOURLY_INITi: of_ngd_order (hourly)

[priority: 500]

KS_PREAGG_HOURLY_QUERYi: ks_preagg (5m)

o: ks_preagg (hourly)[latency: 90]

[priority: 300]

NGD_RECONCILER_LZ2_HOURLY_QUERYi: ngd_predict_preagg (5m)

o: ngd_reconciler_lz2 (hourly)[latency: 60]

[priority: 100]

KS_CLICK_BIDDED_HOURLY_QUERYi: ks_bidded_click,cm_click_bidded_terms (hourly)

o: ks_click_bidded (hourly)[latency: 60]

[priority: 500]

KS_CLICK_BIDDED_HOURLY_EXP_REPORTING_TAG_QUERYi: annotated_ks_click (hourly)o: ks_bidded_click (hourly)

[latency: 60][priority: 500]

KS_OFFER_BIDDED_HOURLY_CFIo: cm_serve_bidded_terms (hourly)

[latency: 60][priority: 500]

SOX_METRICS_HOURLY_ROLLUPi: sox_metrics_impr (5m)

o: sox_metrics_impr (hourly)[latency: 60]

[priority: 500]

IMS_QUERYi: post_tp_annotated_gd_impression (5m)

o: gd_ims (hourly)[latency: 60]

[priority: 100]

ACT_SRV_TGTSRVi: gd_serve (5m)

o: act_apex_serves,act_apex_targeted_serves (hourly)[latency: 60]

[priority: 500]

CURRENCY_LOOKUP

CURRENCY_LOOKUP_DATA_CHECKi: currency_lookup (hourly)

[priority: 500]

ACT_YOO_SRV_TGTSRVi: yoo_gd_serve_sorted (hourly)

o: act_yoo_serves,act_yoo_targeted_serves (hourly)[latency: 60]

[priority: 100]

Page 23: July 2012 HUG: Building Data Pipelines on Hadoop

23

Challenges

•  Scale •  Low Latency •  Operational challenges

– Zero downtime upgrades – Reprocessing – Late data processing – Catch up – Capacity Planning

•  Data Quality •  Business Agility

– Schema evolution

Page 24: July 2012 HUG: Building Data Pipelines on Hadoop

24

Data Pipeline Components

Component Definition Product

Data Collection Ability to transport data from data event producers to a single repository

Y! Data Highway

Data Acquisition Ability to pull from a variety of external sources GDM

Data Storage System to store and access large volumes of data quickly

HDFS

Data Processing The ability to transform data in various useful ways including annotation, filtering and aggregation

M/R, PIG, Hive

Table Management / Meta Data

Provide a consistent API for data consumers with a standard meta data system

HCatalog

Job Coordination/Scheduling

Ability to schedule, submit, manage, retry, reprocess, catch up a DAG

Oozie

Data Output Enables push or pull based delivery of data subject to policies

HDFS Proxy

Data Policy Management Anonymize, retain, clean up and archive data GDM archive

Monitoring / System Management

Provide the ability to operate, visualize and install pipelines

Custom

Page 25: July 2012 HUG: Building Data Pipelines on Hadoop

25

Questions?