Christopher Smith
● VP of Engineering, Data Science
● FanGraph - Single view of the fan
▪Massive Traffic Spikes
▪Real Time Data at Scale
▪Separating Fans from Bots
Data Science?
Doing it live 6
8 Data Sets
Engage-
ment
Entitle-
ments
Access
Scans 10 Inventory
Avails
20 Inventory
Avails Payment
Member
Account
In
Venue Social
Event
Meta
Settle-
ment
BC Distributed
Commerce
Recos Abuse
Prevention
Resale
Fraud
Marketing &
Sponsorship
Winback Customer
Service
Analytics Verified Fan
Data From Everywhere… to Everywhere
9
Core Principles
• Source systems have one job: publish mutations/data
• Consumers don’t impact producers
• Process data one tuple at a time, in a FP-like fashion
• Data is available everywhere
• Data is always archived
• Organize data per project’s needs
10
Core Components
• Kafka: reliable, scalable data transit – one cluster per DC
• CKAN: data discovery
• Avro: efficient, extensible serialization w/schema
• Secor: archiving
• Storm/Trident: complex lambda processing
• Vowpal Wabbit: fast, online machine learning
• ElasticSearch: flexible document search
14
Secor
• Highly scalable
• Aggregate Kafka topic data by time
• Flexible output formats (sequence files, parquet, csv)
• Compression
• Store in s3 (cheap)
16
Storm/Trident
public class MyFunction extends BaseFunction {
public void execute(TridentTuple tuple, TridentCollector collector) { … }
}
public class MyFilter extends BaseFilter {
public boolean isKeep(TridentTuple tuple) { … }
}
18
Vowpal Wabbit
• Online/out-of-core learning
• Extraordinarily fast – handles large sparse feature spaces
• Broad selection of algorithms
• Scalable (all reduce)
• Active & reinforcement learning
• Contextual bandit FTW!
19
Elastic Search
• Generic, sharded & redundant search engine
• Real-time indexing
• Distinct master, data, ingest, coordination node roles
• Handles semi-structured data
• Kibana: secret weapon
24
Lambda Architecture FTW!
• Reusuable functions
• Dev team autonomy/loose coupling
• Simplifies handling failure
• Simplifies scaling
• Complex computation with low latency
• Even less computationally intensive tasks become easier
Top Related