Post on 17-Aug-2015
Plan
Who is MindGeek? Data Processing Challenges Big Data as a Service Lightning: Our data stack
Goal Technologies Architecture Customizations
Uses Ads Data Processing (including Real-time processing) User Profiling Anomaly Detection Fraud Detection
Q&A04/18/2023
Who is MindGeek?
Founded in 2004 More than 1,000 employees Specialized in High traffic, High volume websites Active in Content Delivery, Streaming Media and Online
Advertising Behind several of the biggest, most trafficked sites online
100M+ Daily visitors More than 3 Billion ad impressions served every day
You might have heard of our brands:
04/18/2023
Who is MindGeek?
Founded in 2004 More than 1,000 employees Specialized in High traffic, High volume websites Active in Content Delivery, Streaming Media and Online
Advertising Behind several of the biggest, most trafficked sites online
100M+ Daily visitors More than 3 Billion ad impressions served every day
You might have heard of our brands:
04/18/2023
Data Processing Challenges
TrafficJunky.net – our Ad Platform existed for a number of years Generates >10TB of structured data per day Cannot go down – ever
RDBMS for everything Administration Interface Statistics Aggregation
Throwing money at the problem gets old Buy bigger servers Build more features Repeat
A new approach was needed While supporting the legacy system, adding new features and building the
new system04/18/2023
Big Data as a Service
Lots of acquisitions in the past years Several different technology stacks/architectures Several companies in a company
Create a company-wide data transport, transformation, aggregation and storage service Dedicated Experts Developers, Data Scientists, Business Analyst can concentrate on
their knowledge domain Cost reduction Raises the Bus Factor No more week-end emails (at least, for me)!
04/18/2023
Lightning!
Goal A real time, highly available, scalable end to end solution to transport, process, store
and report on any business critical data
Technologies used Transportation
Kafka, Flume, ZeroMQ, scp Processing
Hadoop (YARN), Samza, Tez Storage
HDFS Reporting
PostgreSQL, Hue, Hive, TJ Analytics, etc. Management
ZooKeeper, Custom Software to monitor, configure and submit jobs
04/18/2023
Lightning!
04/18/2023
Samza & Kafka as the Core infrastructure Stream based
Aggregate and discard Online Algorithms Low latency decisions (in the same session)
Scalable Spinning new instances is fast If something crashes, we can rewind the feed and start over
High-Availability Fits our problem domain Disaster Recovery
Multiple Data Centers
Lightning – Customization
Samza Worker Upload/Configuration/Administration Cluster/Job Monitoring and Debugging
04/18/2023
Ad Impressions Data Processing
04/18/2023
Delivery Server
Delivery Server
Delivery Server
...
KafkaTopic
Write to HDFS/Hive
Data Mart
Hive
Aggregation
Pre-Aggregation
In-MemoryDatabase
BI Interface:TJ Analytics
HUE
Live Database
SalesData Science
Engineers
Data Science
Data ScienceDeployment
EngineersSales
Detailed processing for the Data Mart
04/18/2023
Delivery Server
Delivery Server
Delivery Server
...
KafkaTopic
FilteringEvent
AggregationMeta-
Aggregation
Meta-data Update
OutputKafkaTopic Data Mart
PersistenceLayer
Process function – Called at every event
Window function, called every
minute
Window function, called every 15 minutes
User Profiling
04/18/2023
Delivery Server
Delivery Server
Delivery Server
...
KafkaTopic
Delivery Server
Delivery Server
Delivery Server
...
Content-based
Profiling
Partitioning by hashed
GUID
OutputKafkaTopic Fast Storage
User Profile API
Secret Sauce
Process function – Called on every event
Anomaly Detection
04/18/2023
Delivery Server
Delivery Server
Delivery Server
...
KafkaTopic
Correlator
Template Storage
Event Storage
Template building
Pre-aggregati
on
BI Interface:TJ Analytics
SalesData Science
Engineers
Alerting System
Fraud Detection
04/18/2023
Event Aggregator
APIKafkaTopic
Knowledge DB
FlumeSamza
WorkerLogin Page Analysis
ResultsDB
Billing Software
PostgreSQL
DynamoDB HTTPSPaysite Login Pages