Stream Processing in SmartNews #jawsdays

Post on 18-Jan-2017

17.613 views 5 download

Transcript of Stream Processing in SmartNews #jawsdays

Stream Processing in SmartNews

Takumi Sakamoto 2016.03.12

Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷

http://bit.ly/1MCOyBX

JAWSDAYS 2015

AWS Case Study

http://aws.amazon.com/solutions/case-studies/smartnews/

What is SmartNews?

• News Discovery App for Mobile

• Launched in 2012

• 15M+ Downloads in World Wide

https://www.smartnews.com/en/

How We Deliver News?

Internet Algorithms TrendingNews

Why Stream Processing?

Today’s News is Wrapping

Tomorrow’s Fish and Chips

↑ Yesterday's News

http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/

News Articles Lifetime

https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/

Speed is Matter for Us

System Overview

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

Index System

Feedback System

1 minute

5 minute

Index System

• Crawler

• collect news articles & social signals

• Analyzer

• extract title, content, thumbnail...

• classify topics (sports, politics, technology...)

• Indexer

• upload article metadata into CloudSearch

Feedback System

• API Tracker

• receive user's activity log from mobile app

• Spark Streaming

• generate various metrics for news ranking

• stored metrics into DynamoDB

How to Glue Each Service?

Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications

Why Kinesis Streams?

• Fully managed service

• Multiple consumer applications

• Reasonable pricing

Multiple Consumers

Kinesis Stream

Sparkon EMR

AWSLambda

DataScientist

I wanna consume streaming data by

Spark

ApplicationEngineer

I wanna add a streaming monitor

by Lambda

Empowers Engineers to Do Trial and Error

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

KinesisStream

KinesisStream

KinesisStream

Data & Its Numbers

• User activities

• ~100 GBs per day (compressed)

• 60+ record types

• User demographics or configurations etc...

• 15M+ records

• Articles metadata

• 100K+ records per day

How We Produce/Consume Kinesis Streams?

Index System

Crawler

KPL

KPL

KPL

KCL

KCL

KCL

KPL

KPL

KPL

Analyzer

KCL

KCL

KCL

Indexer

CloudSearch

Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)

Kinesis Libraries

• Kinesis Producer Library (KPL)

• put records into an stream

• asynchronous architecture (buffer records)

• Kinesis Consumer Library (KCL)

• consume and process data from an stream

• handle complex tasks associated with distributed

computing

KPL/KCL Monitoring

• KPL/KCL publishes custom CloudWatch metrics

• Key Metrics for KPL

• User Record Received, User Record Pending

• All Errors

• Key Metrics for KCL

• RecordsProcessed

• MillisBehindLatest

• RecordProcessor.processRecords.Time

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html

Monitoring with Datadog

Feedback System

Generate Metrics by User Clusters for Ranking Articles

Amazon CloudSearch

APISearch

APIGateway

Kinesis Stream

Amazon S3 Hive / Spark

DynamoDB

UserClusters

UserFeedback

APITracker

Amazon S3

Offline ETL / Machine Learning

PushNotification

ArticleMetadata

Metricsby Cluster

Why Metrics by Cluster?

Consider Each User's Interests

Ensure Diversity for Avoiding Filter Bubble

https://en.wikipedia.org/wiki/Filter_bubble

Amazon CloudSearch

API

DynamoDB

Article raw score

San Fransisco Giants … 3.5

New York Yankees … 6.2

FIFA World Cup … 20.4

U.S.Open Championships … 8.4

weight

1

0.6

0.2

0.2

score

3.5

3

4.08

1.68

+ =User

GET /news/sports

Metrics by User Cluster

Article Inventry

userId: 1000gender: Maleage: 36location: San Fransisco, USinterests: Baseball

Input Data by Fluentd

• Forwarder (running on each instances)

• archive events to S3

• forward events to aggregators

• Aggregator (HA Configuration※)

• put events into Kinesis Stream

• alert and report (not mentioned here)

※ http://docs.fluentd.org/articles/high-availability

Example Configurations

<source> @type tail tag smartnews.user_activity ... </source>

<match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match>

Forwarder

<source> @type forward ... </source>

<match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match>

Aggregator

http://docs.fluentd.org/articles/kinesis-stream

Offline ETL Flow

Transform Text Files into Columnar Files Various Machine Learning Tasks

API

RDS

{ “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 }}

userId, age, gender, location, 1234, 28, M, Tokyo, …1235, 32, F, Nagano, …1240, 18, F, Keyoto, …

Amazon S3

Hive on EMR

Amazon S3

Airflow

ManageWorkflow

Activities

Users

Spark on EMR

Airflow: Workflow Engine

Execute Task A -> Task B -> Task C, D

5 * * * * app hive -f query_1.hql

15 * * * * app hive -f query_2.hql

30 * * * * app hive -f query_3.hql

Spark Streaming

Kinesis Stream

Shard 1

Shard 2

Shard3

Dstream 1

Dstream 2

Dstream 3

RDD

RDD

RDD

RDD

Female

Male

+

Minutely RDD

Teen

Female

Male

Teen

Female

Male

Teen

Minutely Metrics by User Cluster

DynamoDB

.

.

.Pre Computed RDD

Split Streams into Minutely RDD

Join Minutely RDD on PreComputed RDD

Monitor Spark Streaming

Spark UI is Useful for Monitoring

Integrate with CloudWatch

class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay =>

putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } }

Set Alert to SchedulingDelay

Summary

Summary

• Fast & stable stream processing is crucial for SmartNews

• lifetime of news is very short

• process events as fast as possible

• Kinesis Stream plays an important role

• one-click provision & scale

• empowers engineers to do trial & error

Discuss More?

Join Our Free Lunch in Tokyo Office!!

We’re hiring!!!

ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer

http://about.smartnews.com/en/careers/

See Also

• SmartNews の Webmining を支えるプラットフォーム

• Stream 処理と Offline 処理の統合

• Building a Sustainable Data Platform on AWS

• AWS meetup「Apache Spark on EMR」

PipelineDB

PipelineDB

• OSS & enterprise streaming SQL database

• PostgreSQL compatible

• connect to Chartio 😍

• join stream to normal PostgreSQL table

• Support probabilistic data structures

• e.g. HyperLogLog

https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/

Realtime Monitoring

APIGateway

Stream

Continuous View

Continuous View

Continuous View

Discard raw record soon afterconsumed by Continuous View

Incrementallyupdated in realtime

PipelineDB Chartio

AWSLambda

Slack

Access Continuous Viewby PostgreSQL Client

Record

Continuous View

-- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname;

-- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream;

-- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;

Dashboard in Chartio1. Building query (Drag&Drop / SQL)

2. Add step (filter、sort、modify)

3. Select visualize way (table、graph)