Stream Processing in SmartNews #jawsdays

Stream Processing in SmartNews

Takumi Sakamoto 2016.03.12

Takumi Sakamoto @takus 😍 = ⚽ ✈ 📷

http://bit.ly/1MCOyBX

JAWSDAYS 2015

AWS Case Study

http://aws.amazon.com/solutions/case-studies/smartnews/

What is SmartNews?

• News Discovery App for Mobile

• Launched in 2012

• 15M+ Downloads in World Wide

https://www.smartnews.com/en/

How We Deliver News?

Internet Algorithms TrendingNews

Why Stream Processing?

Today’s News is Wrapping

Tomorrow’s Fish and Chips

↑ Yesterday's News

http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/

News Articles Lifetime

https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/

Speed is Matter for Us

System Overview

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

Index System

Feedback System

1 minute

5 minute

Index System

• Crawler

• collect news articles & social signals

• Analyzer

• extract title, content, thumbnail...

• classify topics (sports, politics, technology...)

• Indexer

• upload article metadata into CloudSearch

Feedback System

• API Tracker

• receive user's activity log from mobile app

• Spark Streaming

• generate various metrics for news ranking

• stored metrics into DynamoDB

How to Glue Each Service?

Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications

Why Kinesis Streams?

• Fully managed service

• Multiple consumer applications

• Reasonable pricing

Multiple Consumers

Kinesis Stream

Sparkon EMR

AWSLambda

DataScientist

I wanna consume streaming data by

ApplicationEngineer

I wanna add a streaming monitor

by Lambda

Empowers Engineers to Do Trial and Error

News Delivery Pipeline

CrawlerInternet Analyzer Indexer CloudSearch APISearch

APIGateway

Mobile App

APITracker

DynamoDB

KinesisStream

Data & Its Numbers

• User activities

• ~100 GBs per day (compressed)

• 60+ record types

• User demographics or configurations etc...

• 15M+ records

• Articles metadata

• 100K+ records per day

How We Produce/Consume Kinesis Streams?

Index System

Crawler

Analyzer

Indexer

CloudSearch

Collect, Analyze and Index Articles with Kinesis Libraries (KPL & KCL)

Kinesis Libraries

• Kinesis Producer Library (KPL)

• put records into an stream

• asynchronous architecture (buffer records)

• Kinesis Consumer Library (KCL)

• consume and process data from an stream

• handle complex tasks associated with distributed

computing

KPL/KCL Monitoring

• KPL/KCL publishes custom CloudWatch metrics

• Key Metrics for KPL

• User Record Received, User Record Pending

• All Errors

• Key Metrics for KCL

• RecordsProcessed

• MillisBehindLatest

• RecordProcessor.processRecords.Time

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html

https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html

Monitoring with Datadog

Feedback System

Generate Metrics by User Clusters for Ranking Articles

Amazon CloudSearch

APISearch

APIGateway

Kinesis Stream

Amazon S3 Hive / Spark

DynamoDB

UserClusters

UserFeedback

APITracker

Amazon S3

Offline ETL / Machine Learning

PushNotification

ArticleMetadata

Metricsby Cluster

Why Metrics by Cluster?

Consider Each User's Interests

Ensure Diversity for Avoiding Filter Bubble

https://en.wikipedia.org/wiki/Filter_bubble

Amazon CloudSearch

DynamoDB

Article raw score

San Fransisco Giants … 3.5

New York Yankees … 6.2

FIFA World Cup … 20.4

U.S.Open Championships … 8.4

weight

+ =User

GET /news/sports

Metrics by User Cluster

Article Inventry

userId: 1000gender: Maleage: 36location: San Fransisco, USinterests: Baseball

Input Data by Fluentd

• Forwarder (running on each instances)

• archive events to S3

• forward events to aggregators

• Aggregator (HA Configuration※)

• put events into Kinesis Stream

• alert and report (not mentioned here)

※ http://docs.fluentd.org/articles/high-availability

Example Configurations

<source> @type tail tag smartnews.user_activity ... </source>

<match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match>

Forwarder

<source> @type forward ... </source>

<match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match>

Aggregator

http://docs.fluentd.org/articles/kinesis-stream

Offline ETL Flow

Transform Text Files into Columnar Files Various Machine Learning Tasks

{ “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 }}

userId, age, gender, location, 1234, 28, M, Tokyo, …1235, 32, F, Nagano, …1240, 18, F, Keyoto, …

Amazon S3

Hive on EMR

Amazon S3

Airflow

ManageWorkflow

Activities

Spark on EMR

Airflow: Workflow Engine

Execute Task A -> Task B -> Task C, D

5 * * * * app hive -f query_1.hql

Spark Streaming

Kinesis Stream

Shard 1

Shard 2

Shard3

Dstream 1

Dstream 2

Dstream 3

Female

Minutely RDD

Female

Minutely Metrics by User Cluster

DynamoDB

.Pre Computed RDD

Split Streams into Minutely RDD

Join Minutely RDD on PreComputed RDD

Monitor Spark Streaming

Spark UI is Useful for Monitoring

Integrate with CloudWatch

class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay =>

putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } }

Set Alert to SchedulingDelay

Summary

• Fast & stable stream processing is crucial for SmartNews

• lifetime of news is very short

• process events as fast as possible

• Kinesis Stream plays an important role

• one-click provision & scale

• empowers engineers to do trial & error

Discuss More?

Join Our Free Lunch in Tokyo Office!!

We’re hiring!!!

ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer

http://about.smartnews.com/en/careers/

Stream Processing in SmartNews #jawsdays

Technology

Transcript of Stream Processing in SmartNews #jawsdays

StreamCloud: An Elastic Parallel-Distributed Stream Processing … · 2020. 12. 1. · for high processing capacity with low processing latency guarantees. With Stream Processing

Big Data Stream Processing - ScaDS · Big Data Stream Processing ... • More details on Storm, Spark, Flink ... • Unified primitives for batch and stream processing

stream processing engine

Stream Processing Platforms Data

Stream Processing In Go

Event Stream Processing - General Electric · 2020-03-30 · Event Stream Processing. About Event Stream Processing. Analyze continuously flowing data over long periods of time where

Towards Declarative Stream Processing using Apache Flink€¦ · Stream Processing with Apache Flink • Flexible Windows/Stream Discretization • Exactly-once Processing & Fault

Oracle Stream Explorer - Simplifying Event/Stream Processing

Twitter Stream Processing

Why and How SmartNews uses SaaS?

On web stream processing

Oracle Stream Analytics - Simplifying Stream Processing

SmartNews : Semantic Searching of News Video

Stateful Distributed Stream Processing

MillWheel: Fault-Tolerant Stream Processing at Internet Scale · MillWheel: Fault-Tolerant Stream Processing at Internet Scale Tyler Akidau, ... INTRODUCTION Stream processing ...

Down Stream Processing

Introduction to stream processing

XML Stream Processing

High Performance Stream Processing

On Unified Stream Reasoning - The RDF Stream Processing realm