Post on 18-Jan-2017
Stream Processing in SmartNews
Takumi Sakamoto 2016.03.12
Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
What is SmartNews?
• News Discovery App for Mobile
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
How We Deliver News?
Internet Algorithms TrendingNews
Why Stream Processing?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑ Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
News Articles Lifetime
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Speed is Matter for Us
System Overview
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch APISearch
APIGateway
Mobile App
APITracker
DynamoDB
Index System
Feedback System
1 minute
5 minute
Index System
• Crawler
• collect news articles & social signals
• Analyzer
• extract title, content, thumbnail...
• classify topics (sports, politics, technology...)
• Indexer
• upload article metadata into CloudSearch
Feedback System
• API Tracker
• receive user's activity log from mobile app
• Spark Streaming
• generate various metrics for news ranking
• stored metrics into DynamoDB
How to Glue Each Service?
Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications
Why Kinesis Streams?
• Fully managed service
• Multiple consumer applications
• Reasonable pricing
Multiple Consumers
Kinesis Stream
Sparkon EMR
AWSLambda
DataScientist
I wanna consume streaming data by
Spark
ApplicationEngineer
I wanna add a streaming monitor
by Lambda
Empowers Engineers to Do Trial and Error
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch APISearch
APIGateway
Mobile App
APITracker
DynamoDB
KinesisStream
KinesisStream
KinesisStream
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
How We Produce/Consume Kinesis Streams?
Index System
Crawler
KPL
KPL
KPL
KCL
KCL
KCL
KPL
KPL
KPL
Analyzer
KCL
KCL
KCL
Indexer
CloudSearch
Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)
Kinesis Libraries
• Kinesis Producer Library (KPL)
• put records into an stream
• asynchronous architecture (buffer records)
• Kinesis Consumer Library (KCL)
• consume and process data from an stream
• handle complex tasks associated with distributed
computing
KPL/KCL Monitoring
• KPL/KCL publishes custom CloudWatch metrics
• Key Metrics for KPL
• User Record Received, User Record Pending
• All Errors
• Key Metrics for KCL
• RecordsProcessed
• MillisBehindLatest
• RecordProcessor.processRecords.Time
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
Monitoring with Datadog
Feedback System
Generate Metrics by User Clusters for Ranking Articles
Amazon CloudSearch
APISearch
APIGateway
Kinesis Stream
Amazon S3 Hive / Spark
DynamoDB
UserClusters
UserFeedback
APITracker
Amazon S3
Offline ETL / Machine Learning
PushNotification
ArticleMetadata
Metricsby Cluster
Why Metrics by Cluster?
Consider Each User's Interests
Ensure Diversity for Avoiding Filter Bubble
https://en.wikipedia.org/wiki/Filter_bubble
Amazon CloudSearch
API
DynamoDB
Article raw score
San Fransisco Giants … 3.5
New York Yankees … 6.2
FIFA World Cup … 20.4
U.S.Open Championships … 8.4
weight
1
0.6
0.2
0.2
score
3.5
3
4.08
1.68
+ =User
GET /news/sports
Metrics by User Cluster
Article Inventry
userId: 1000gender: Maleage: 36location: San Fransisco, USinterests: Baseball
Input Data by Fluentd
• Forwarder (running on each instances)
• archive events to S3
• forward events to aggregators
• Aggregator (HA Configuration※)
• put events into Kinesis Stream
• alert and report (not mentioned here)
※ http://docs.fluentd.org/articles/high-availability
Example Configurations
<source> @type tail tag smartnews.user_activity ... </source>
<match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match>
Forwarder
<source> @type forward ... </source>
<match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match>
Aggregator
http://docs.fluentd.org/articles/kinesis-stream
Offline ETL Flow
Transform Text Files into Columnar Files Various Machine Learning Tasks
API
RDS
{ “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 }}
userId, age, gender, location, 1234, 28, M, Tokyo, …1235, 32, F, Nagano, …1240, 18, F, Keyoto, …
Amazon S3
Hive on EMR
Amazon S3
Airflow
ManageWorkflow
Activities
Users
Spark on EMR
Airflow: Workflow Engine
Execute Task A -> Task B -> Task C, D
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
Spark Streaming
Kinesis Stream
Shard 1
Shard 2
Shard3
Dstream 1
Dstream 2
Dstream 3
RDD
RDD
RDD
RDD
Female
Male
+
Minutely RDD
Teen
Female
Male
Teen
Female
Male
Teen
Minutely Metrics by User Cluster
DynamoDB
.
.
.Pre Computed RDD
Split Streams into Minutely RDD
Join Minutely RDD on PreComputed RDD
Monitor Spark Streaming
Spark UI is Useful for Monitoring
Integrate with CloudWatch
class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay =>
putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } }
Set Alert to SchedulingDelay
Summary
Summary
• Fast & stable stream processing is crucial for SmartNews
• lifetime of news is very short
• process events as fast as possible
• Kinesis Stream plays an important role
• one-click provision & scale
• empowers engineers to do trial & error
Discuss More?
Join Our Free Lunch in Tokyo Office!!
We’re hiring!!!
ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer
http://about.smartnews.com/en/careers/
See Also
• SmartNews の Webmining を支えるプラットフォーム
• Stream 処理と Offline 処理の統合
• Building a Sustainable Data Platform on AWS
• AWS meetup「Apache Spark on EMR」
PipelineDB
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Realtime Monitoring
APIGateway
Stream
Continuous View
Continuous View
Continuous View
Discard raw record soon afterconsumed by Continuous View
Incrementallyupdated in realtime
PipelineDB Chartio
AWSLambda
Slack
Access Continuous Viewby PostgreSQL Client
Record
Continuous View
-- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname;
-- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
Dashboard in Chartio1. Building query (Drag&Drop / SQL)
2. Add step (filter、sort、modify)
3. Select visualize way (table、graph)