Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Post on 19-Jan-2017

109 views 3 download

Transcript of Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://cliparts.co/clipart/3666251

Has anyone ever written crawlers?

Has anyone ever used cron?

Has anyone ever used Sidekiq?

Gary (Chien-Wei Chu) @icarus4 / @icarus4.chu

Was a C programmerFall in love with Ruby since 2013

CTO of Statementdog

I Play

Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg

Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg

Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg

• Introduction to Statementdog

• Introduction to Statementdog

• Data behind Statementdog

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

• How we design our system to solve the problems.

Focus on:

• More reliable job scheduling

• Dealing with throttling issue

(Revenue)

(Revenue)

(EPS)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable) (PMI)

(Revenue)

(EPS)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(ROE)

(ROA)

(Accounts Receivable)

(Accounts Payable) (PMI)

GDP

Taiwan Market Observation Post System ( )

Taiwan Stock Exchange ( )

Taiwan Depository & Clearing Corporation ( )

Yahoo Stock Feed

Yearly - dividend, remuneration of directors and supervisors

Quarterly - quarterly financial statements

Monthly - Revenue

Weekly -

Daily - closing price

Hourly - stock news from Yahoo stock feed

Minutely - important news from Taiwan Market Observation Post System

Something like this, but written in PHP

A super long running process (1 hour+) loops from the first stock to the last one

Stock.find_each do |stock| # download xml financial report data …

# extract xml data …

# calculate advanced data …end

A super long running process for quarterly report

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for daily price

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for daily price

A super long running process for news

.

.

.

• Really slow

• Really slow

• Inefficient - unable to only retry the failed one

• Really slow

• Inefficient - unable to only retry the failed one

• Unpredictable server loading

Job 1 Job 2 Job 3Time

When the server loading is low

Job 4 Job 5

Serverloading

When the server loading is HIGH

Time

Serverloading

Other task

Job 1Job 2

Job 3

When the server loading is HIGH

Job 4Job 5

Time

Serverloading

Other task

Job 1Job 2

Job 3

When the server loading is HIGH

Job 4Job 5

Time

Serverloading

Other task

Too many crawler processes executed at the same time

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

• Inherent problems of Unix Cron:

• Inherent problems of Unix Cron:

• Unreliable scheduling

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

Created by Mike Perham

Web serverRequest

Request

Request

.

.

.

Process

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Web server

Process

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process Add extra servers when needed

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Producer

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Request

Request

Request

.

.

.

Job queue

push to queue(very fast)

Producer

Consumer

Worker process

Worker process

.

.

.

Worker server

Worker process

Web server

Process

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Worker process v.s.

Multi-threadSingle process

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Worker process 1 : 25

Multi-threadSingle process

Multi-thread

Worker process

thread 1

thread 2

thread 3

thread 25

.

.

.

Single process

Worker process 1 : 25

With the same degree of memory consumption

Sidekiq (OSS) Sidekiq Pro

Sidekiq Enterprise

Sidekiq Pro Sidekiq Enterprise

Batches

Enhanced Reliability

Search in Web UI

Worker Metrics

Expiring Jobs

Rate Limiting

Periodic Jobs

Unique Jobs

Historical Metrics

Multi-process

Encryption

Parallelism Make Things Faster

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

• Efficient - only retry the failed one

• Predictable server loading

• Easy to scale out

• Really slow

• Inefficient - unable to only retry the failed one.

• Unpredictable server loading

• Scale out is not easy

• Inherent problem of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

–Mike Perham, CEO, Contributed Systems, Creator of Sidekiq

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settingstable name: cron_jobs

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

worker class name

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

Something like 0 */2 * * *

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settings

when will a job should be executed

klass cron_expression next_run_at

Push2000NewsJobs “0 */2 * * *” …

Push2000DailyPriceJobs “0 2 * * 1-5” …

Push2000MonthlyRevenueJobs “0 0 10 * *” …

# Add to your Cron setting every :minute do runner 'CronJobWorker.perform_async' end

Cron only schedules one job minutely

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job|

end end end

CronJobWorker to invoke all of your crawlers

Find jobs should be executed

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )

end end end

CronJobWorker to invoke all of your crawlers

Push jobs to job queue

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

CronJobWorker to invoke all of your crawlers

Setup the next execution time

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

CronJobWorker to invoke all of your crawlers

The missed job executions will be executed at next minute

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Drawbacks solved

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

NewsWorker “*/30 * * * *” [popular_stock_id_1] …

NewsWorker “*/30 * * * *” [popular_stock_id_2] …

Drawbacks solved

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

Sidekiq.configure_server do |config| config.periodic do |mgr| mgr.register("* * * * * *", CronJobWorker) end end

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

You always want your crawler as fast as possible

However, your target server doesn’t always allow you to crawl with

unlimited rate

Insert 2000 jobs to the queue at the same time

Stock.pluck(:id).each do |stock_id| SomeWorker.perform_async(stock_id) end

If you want to craw data for your 2000 stocks

Assume a target server accepts request at maximum rate equals to 1 request / second

Time (second)

1 2 3

job1 job2 job3

.

.

. job2000

Insert 2000 jobs to the queue at the same time

All of your jobs may be blocked (except the first one)

Improvement 1 Schedule jobs with incremental delays

Stock.pluck(:id).each_with_index do |stock_id, index| SomeWorker.perform_in(index, stock_id) end

Time (second)

1 2 3

job1 job2 job3

…job2000

2000

Workable, but…

1

job1 job2 job3

…job2000

If the target server is unreachable

Time (second)

Workable, but…

1 2 3

job1 job2 job3

…job2000

2000

If the target server is unreachable

job3~2000 will still execute at the same time

Time (second)

• Limit your worker thread to perform specific job with bounded rate

• Sidekiq Enterprise provides two types of rate limiting API

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end Only 10 concurrent operations inside the block

can happen at any given moment

BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) def perform(...) BUCKET_LIMITER.within_limit do # crawl stock data end end

For every second, you can perform up to 10 operations

You must fine tune parameters of your limiter for each data source for better performance

By far, you already got better performance.

However, the throttling control of your target server may not always be static.

Many websites are dynamically throttling controlled.

If throttling detected, pause your workers for a while

Redis (job queue)

Redis (job queue)

default

critical

low

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

(paused)

Pause this queue when throttled

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker thread

Schedule a job executed after few seconds to “unpause" job in another queue

yahoo(paused)

Redis (job queue)

default

critical

low

Worker thread

Worker thread

Worker thread

Worker thread

Worker threadyahoo

(resumed)

Resumed after the unpause queue job executed

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end class ResumeJobQueueWorker include Sidekiq::Worker sidekiq_options queue: :queue_control, unique: :until_executed def perform(queue_name) queue = Sidekiq::Queue.new(queue_name) queue.unpause! if queue.paused? end end

The queue for ResumeJobQueueWorker MUST NOT equal to the paused queue

We have a dedicated queue for ResumeJobQueueWorker

Decrease Sidekiq server poll interval for more precise timing control

Queue pausing alleviates throttling issues Is it possible for us to do things even better?

Most throttling control aim to block requests from the same IP address

We can change our IP address via proxy service

Sidekiq server

Target server

a.b.c.d

Sidekiq server

Target server

a.b.c.d

a.b.c.d

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Same IP for each request

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

proxy servere.f.g.h

Sidekiq server

Target server

a.b.c.d

a.b.c.dProxy

service end

point

proxy server

proxy server

e.f.g.h

i.j.k.l

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Proxy service

end point

proxy server

proxy server

proxy server

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Sidekiq server

Target server

a.b.c.d

a.b.c.d

a.b.c.d

a.b.c.d

Proxy service

end point

proxy server

proxy server

proxy server

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Different IP for each request

• Inherent problem of Unix Cron:

• Unreliable scheduling

• Hard to prioritize job by the popularity

• High availability is not easy

• Not easy to deal with bandwidth throttling issue

• With Sidekiq (Enterprise) and a proper design, the following problems are solved

• Slow crawler

• Inefficient - unable to only retry the failed one

• Unpredictable server loading

• Scale out is not easy

• Inherent problem of Unix Cron

• Not easy to deal with bandwidth throttling issue