How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

How MediaMath Solved a Cri1cal Repor1ng Problem with Impala

©2014 MEDIAMATH INC. 1

The Cloudera Sessions

June 18, 2014 Ram Narayanan, Senior Director of Database Architecture & Opera1ons

Digital Marke1ng Pioneer •  Founded in 2007 •  Global technology company •  Invented first Demand Side PlaJorm (DSP) for online ads •  Conducts online adverNsing through real-‐Nme bidding & programmaNc buying

About MediaMath


About MediaMath Overview of Real-‐Time Bidding

Real-‐1me Auc1on

<30 ms Adver1ser (Client)

User

ad

www.cnn.com

ad

About MediaMath Overview or Real-‐Time Bidding

User

www.cnn.com

Purchased!

ad www.shoes.com

$$ Event Logs

•  Ad OpportuniNes: 80-‐100 billion per day "   1.2 million opportuniNes per second at peak

•  We bid on 30-‐40 billion ads per day •  We serve 1-‐2 billion ads per day •  15-‐20 million events (click, sale, online sign-‐up) per hour •  2 TB of data daily (compressed) "   Note: This only counts our wins. If we count losses, we easily reach PBs.

About MediaMath

Which ad (impression) led to which ac1on, like a sale or online signup •  35-‐40 billion recorded impressions served every 30 days •  15-‐20 million events per hour •  Need to join events with impressions 2x per hour

à Find matching records à Perform complex sequencing & allocaNon logic à Run aggregaNons on results à Send data to data marts

à Provide hourly reporNng to clients

The Repor1ng AZribu1on Problem


Incumbent Architecture: Appliance-‐based (Netezza)

Cost: Expensive -‐ Scale: Non-‐incremental scalability -‐ Performance: ReporNng lag -‐ ReporNng inflexibility

Product feature constrained -‐ -‐

To build a data warehouse architecture that could perform hourly repor1ng of aZribu1on data at scale that is affordable and easy to manage.

Our goal

" Scalability Handle 10-‐50x scale

" Capability Ability to perform big data joins at scale

" Performance Complete aggregaNon in <60 minutes

" Cost effec1ve Cheaper than appliance-‐based soluNons


EvaluaNon Criteria:

" Hive Run Nme: Took 5-‐6 hours to complete Stability: High

" Pig Run Nme: Took 4-‐5 hours to complete Stability: High

" Impala Beta (0.6) Run Nme: Took 2-‐3 hours to complete Stability: Low

Evaluated OpNons: Round 1

" Hive: Post-‐Tuning (map joins, bucke1ng, split size, etc.) Run Nme: Took 2-‐3 hours to complete Stability: High

" Impala GPA (1.0) (L0 compression, slicing, tuning, hw upgrade) Run Nme: Took 30 minutes to complete Stability: High

Evaluated OpNons: Round 2

Data Warehouse Architecture 2011

Bid Logs

Pixel Logs

Metadata

Repor1ng Data Marts

Repor1ng Data Marts

Repor1ng Data Marts

Repor1ng Data Marts

ELT

A T T R I B U T I O N

Reports

Aggrega1on

Netezza 2011

Data Warehouse Architecture 2011

Bid Logs

Pixel Logs

Metadata

Repor1ng Data Marts

Repor1ng Data Marts

Repor1ng Data Marts

Repor1ng Data Marts

ELT

A T T R I B U T I O N

Reports

Aggrega1on

Reports Aggrega1on

Netezza Hadoop

2013

•  December 2013: Peak season "   New architecture accommodated 2x data volume with unprecedented scalability & stability

•  Present: We are planning to add more features "   Considering moving some part of aggregaNon into Hadoop

Proof:


•  Process ONLY the required data •  Compress your data •  “Divide & Conquer” your data (i.e. slice and dice)

Lessons Learned & Best Prac1ces


THANK YOU

How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Data & Analytics

Transcript of How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala