How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

16
How MediaMath Solved a Cri1cal Repor1ng Problem with Impala ©2014 MEDIAMATH INC. 1 The Cloudera Sessions June 18, 2014 Ram Narayanan, Senior Director of Database Architecture & Opera1ons

description

At MediaMath, we deal with billions of records every day. One of our biggest challenges is hourly reporting of attribution data - the joining of billions of records to millions of events. How did we solve this hourly attribution reporting issue? We will walk through our evaluation, testing, and fine tuning of a variety of tools including Netezza, Hive, and Pig, and how we ultimately chose Cloudera's Impala.

Transcript of How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Page 1: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

How  MediaMath  Solved  a  Cri1cal  Repor1ng  Problem  with  Impala  

©2014  MEDIAMATH  INC.    1  

The  Cloudera  Sessions  

June  18,  2014  Ram  Narayanan,  Senior  Director  of  Database  Architecture  &  Opera1ons  

Page 2: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Digital  Marke1ng  Pioneer  •  Founded  in  2007  •  Global  technology  company  •  Invented  first  Demand  Side  PlaJorm  (DSP)  for  online  ads    •  Conducts  online  adverNsing  through  real-­‐Nme  bidding  &  programmaNc  buying  

 

About  MediaMath  

©2014  MEDIAMATH  INC.    2  

Page 3: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

About  MediaMath  Overview  of  Real-­‐Time  Bidding  

Real-­‐1me  Auc1on  

<30  ms  Adver1ser  (Client)  

User  

   ad      

www.cnn.com  

ad  

Page 4: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

About  MediaMath  Overview  or  Real-­‐Time  Bidding  

User  

www.cnn.com  

   Purchased!  

   ad  www.shoes.com  

    $$   Event  Logs  

Page 5: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

•  Ad  OpportuniNes:  80-­‐100  billion  per  day    "   1.2  million  opportuniNes  per  second  at  peak  

•  We  bid  on  30-­‐40  billion  ads  per  day  •  We  serve  1-­‐2  billion  ads  per  day  •  15-­‐20  million  events  (click,  sale,  online  sign-­‐up)  per  hour  •  2  TB  of  data  daily  (compressed)  "   Note:  This  only  counts  our  wins.  If  we  count  losses,  we  easily  reach  PBs.  

About  MediaMath  

Page 6: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Which  ad  (impression)  led  to  which  ac1on,  like  a  sale  or  online  signup  •  35-­‐40  billion  recorded  impressions  served  every  30  days  •  15-­‐20  million  events  per  hour  •  Need  to  join  events  with  impressions  2x  per  hour  

à  Find  matching  records  à  Perform  complex  sequencing  &  allocaNon  logic  à  Run  aggregaNons  on  results  à  Send  data  to  data  marts  

 à  Provide  hourly  reporNng  to  clients  

 

The  Repor1ng  AZribu1on  Problem  

©2014  MEDIAMATH  INC.    6  

Page 7: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Incumbent  Architecture:    Appliance-­‐based  (Netezza)    

Cost:  Expensive  -­‐  Scale:  Non-­‐incremental  scalability  -­‐  Performance:  ReporNng  lag  -­‐  ReporNng  inflexibility  

Product  feature  constrained  -­‐  -­‐  

Page 8: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

To  build  a  data  warehouse  architecture  that  could  perform  hourly  repor1ng  of  aZribu1on  data  at  scale  that  is  affordable  and  easy  to  manage.    

Our  goal  

Page 9: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

" Scalability  Handle  10-­‐50x  scale  

" Capability    Ability  to  perform  big  data  joins  at  scale  

" Performance  Complete  aggregaNon  in  <60  minutes  

" Cost  effec1ve  Cheaper  than  appliance-­‐based  soluNons      

©2014  MEDIAMATH  INC.    9  

EvaluaNon  Criteria:  

Page 10: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

" Hive  Run  Nme:  Took  5-­‐6  hours  to  complete  Stability:  High    

" Pig  Run  Nme:  Took  4-­‐5  hours  to  complete  Stability:  High  

" Impala  Beta  (0.6)  Run  Nme:  Took  2-­‐3  hours  to  complete  Stability:  Low        

Evaluated  OpNons:  Round  1  

Page 11: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

" Hive:  Post-­‐Tuning  (map  joins,  bucke1ng,  split  size,  etc.)  Run  Nme:  Took  2-­‐3  hours  to  complete  Stability:  High    

" Impala  GPA  (1.0)  (L0  compression,  slicing,  tuning,  hw  upgrade)  Run  Nme:  Took  30  minutes  to  complete  Stability:  High        

Evaluated  OpNons:  Round  2  

Page 12: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Data  Warehouse  Architecture  2011  

Bid  Logs  

Pixel  Logs  

Metadata  

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

             

                   

ELT  

A  T  T  R  I  B  U  T  I  O  N  

Reports  

Aggrega1on  

Netezza  2011  

Page 13: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Data  Warehouse  Architecture  2011  

Bid  Logs  

Pixel  Logs  

Metadata  

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

Repor1ng    Data  Marts    

             

                   

ELT  

A  T  T  R  I  B  U  T  I  O  N  

Reports  

Aggrega1on  

           

                   

Reports  Aggrega1on  

Netezza  Hadoop  

2013  

Page 14: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

•  December  2013:  Peak  season  "   New  architecture  accommodated  2x  data  volume  with  unprecedented  scalability  &  stability  

•  Present:  We  are  planning  to  add  more  features    "   Considering  moving  some  part  of  aggregaNon  into  Hadoop  

Proof:    

©2014  MEDIAMATH  INC.    14  

Page 15: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

•  Process  ONLY  the  required  data  •  Compress  your  data  •  “Divide  &  Conquer”  your  data  (i.e.  slice  and  dice)  

Lessons  Learned  &  Best  Prac1ces  

©2014  MEDIAMATH  INC.    15  

Page 16: How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

THANK  YOU