Post on 18-Jun-2015
description
1
Apache S4: A Distributed Stream Computing Platform
Presented at Stanford Infolab – Nov 4, 2011
http://incubator.apache.org/projects/s4 (migrating from http://s4.io)
S4 Committers: {fpj, kishoreg, leoneu, mmorel, robbins}@apache.orgPresented by Leo Neumeyer (@leoneu)
2
About Me
Born in Buenos Aires, Argentina, studied EE.School/Work in Canada (Signal Processing, Speech Coding).SRI Int'l (Menlo Park) Speech Lab, DARPA benchmarks, lab founded speech recognition spin-off Nuance Comm Inc.Mindstech: Startup to teach spoken English in Asia using web audio/video (before 2-way media was widely available).Yahoo! Labs: Search advertising (optimization, auctions).Quantbench: mission is to create a marketplace for data scientists, data providers, and investment funds.
3
S4 Project History
Started as a research project at Yahoo! Labs in August 2008 out of the need to personalize search ads in real-time.Open sourced in September 2009.Moved to Apache Incubator in October 2011.
4
Motivation
given multiple event streamsextract information
using data driven modelsin real time
with low latencyat scale
Personalized Search Twitter TrendsOnline Parameter
Optimization
Predict Market PricesAutomatic Trading
Network IntrusionDetection
Spam Filtering
Sensor Networks
It's Fun!
5
S4 Architecture
S4 is a general-purpose, real-time, distributed, decentralized, robust, scalable, event driven, pluggable platform that allows programmers to easily implement applications for processing continuous unbounded streams of data.
Server AppAppApp AppAppPE Prototype
AppAppStream
AppAppPE InstanceAppAppNode
Apps encapsulate units of work. They can consume and produce event streams.
Unlimited number of nodes. Each node has one process.
There is one server process per node. The server loads/unloads apps.
An app is a graph composed of PE prototypes and streams that produce, consume, and transmit msgs.
PE instances are clones of the prototype. They are associated with a unique key and contain the state.
6
Latency vs. Accuracy
Zero Errors Real-Time
Latency ➔ Unconstrained ➔ Constrained
Why? ➔ Reproducible results ➔ Limited control over inbound data rate and computing complexity
Use ➔ Debug➔ Train Models
➔ Process unstructured data➔ Tolerance to small errors➔ Graceful recovery from
inbound data streams
7
Design
Actors programming model.Probabilistic thinking in both algorithms and systems.Run on commodity hardware.All in-memory, no disk bottlenecks.Pluggable (Protocols, applications, serialization, etc.)Object oriented design → POJOsStatic typing, no string literals, minimize type casting.Science friendly → constant change, ease of use.
8
Programming Model
Example: estimate click-through rate in a web application after applying a filter to remove bot traffic.
9
Coding an App
10
Research Areas: Systems
Checkpointing strategiesReplication strategiesDynamic load balancingAdaptive load managementQuery languages
11
Fault Tolerance
Problem Approaches S4
High Availability ➔ Warm/hot failover➔ Cold failover
➔ Warm failover➔ Standby nodes +
Apache Zookeeper
State Loss(Crashes, system updates)
➔ Lossy checkpointing➔ Lossless checkpoint.
➔ Lossy checkpointing
Low Latency ➔ Decouple stream processing from checkpointing
➔ Asynchronous writes➔ Uncoordinated
checkpointing
Approach: checkpoints are count or time based, pluggable backend to support any data store, lazy PE restore, tuning is application dependent.
Research by M. Morel, F. Junqueira, Yahoo! Research Europe, 2011.
12
Resilience in a Distributed Word Count Task
13
Research Areas: Algorithms
Self-adaptive models: adaptive language models using small amounts of data.Personalization: learn from user feedback (clicks, location, behavior) to deliver relevant information in RT.Trend detection: find personal Twitter trends relevant to you.Intrusion detection: summarize high level state of the network and detect unusual patterns.Sensor networks: large amounts of audio/video and other sources require processing, recognition, detection, and tracking. Detect events across sensors.
14
Personalized Search Ads
Goal is to maximize:RevenueClick yieldUser experience
By controlling:RankingPricingFilteringPlacement
S. Schroedl, A. Kesari, and L. Neumeyer, “Personalized ad placement in web search,” in ADKDD ’10: Proceedings of the 4th Annual International Workshop on Data Mining and Audience Intelligence for Online Advertising, 2010.
15
Personalized Search Ads
Model ad click intent using recent user activity.More likely to click → show more North ads.
Example 1First query is digital slr cameraNext query is canon slrMore likely than average to click another ad
Example 2Repeated query without previous clicksLess likely to click another ad
16
Personalized Search Ads
Modeling user session
Typical features:Number of searches/clicks by user past 24 hrsUser COPC: Ratio of observed clicks to predicted clicksIdentical query searched before / clicked beforeTime (seconds) since last search/clickSimilarity measures: current vs. previous queries
Modeling technique: stochastic gradient-descent boosted trees (GDBT)
17
Personalized Search Ads
Target
P[CLICK|ad,query,user]
Approximation
P[CLICK|ad,query]*ucp[user,session]
Non-personalizedlong-term model
computed using Hadoop
User Click Propensity (UCP)for user session
computed using S4
18
Personalized Search Ads
Results:
We can reduce the average number of ads (ad footprint) by 7% without decreasing click yield and revenue.
- OR -
For a given ad footprint we can increase click yield by ~2%.
19
Thank you!
Join the Apache S4 project:
s4-user-subscribe@incubator.apache.org
s4-dev-subscribe@incubator.apache.org