Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Photo: http://cliparts.co/clipart/3666251

Has anyone ever written crawlers?

Has anyone ever used cron?

Has anyone ever used Sidekiq?

Gary (Chien-Wei Chu) @icarus4 / @icarus4.chu

Was a C programmerFall in love with Ruby since 2013

CTO of Statementdog

I Play

Photo: https://static01.nyt.com/images/2016/08/19/sports/19BADMINTONweb3/19BADMINTONweb3-master675.jpg

Photo: http://classic.battle.net/images/battle/scc/protoss/pix/units/screenshots/d05.jpg

Photo: http://resources.workable.com/wp-content/uploads/2015/08/ruby-560x224.jpg

• Introduction to Statementdog

• Data behind Statementdog

• Past practice of Statementdog

• Problems of the past practice

• How we design our system to solve the problems.

Focus on:

• More reliable job scheduling

• Dealing with throttling issue

(Revenue)

(Gross Margin)

(Net Income)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Operating Cash Flow)

(Free Cash Flow)

(Investing Cash Flow)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Free Cash Flow)

(Accounts Receivable)

(Accounts Payable)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Free Cash Flow)

(Accounts Payable)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Free Cash Flow)

(Accounts Payable) (PMI)

(Revenue)

(Gross Margin)

(Net Income)

(Assets)

(Liabilities)

(Free Cash Flow)

(Accounts Payable) (PMI)

Taiwan Market Observation Post System ( )

Taiwan Stock Exchange ( )

Taiwan Depository & Clearing Corporation ( )

Yahoo Stock Feed

Yearly - dividend, remuneration of directors and supervisors

Quarterly - quarterly financial statements

Monthly - Revenue

Weekly -

Daily - closing price

Hourly - stock news from Yahoo stock feed

Minutely - important news from Taiwan Market Observation Post System

Something like this, but written in PHP

A super long running process (1 hour+) loops from the first stock to the last one

Stock.find_each do |stock| # download xml financial report data …

# extract xml data …

# calculate advanced data …end

A super long running process for quarterly report

A super long running process for monthly revenue

A super long running process for daily price

A super long running process for news

• Really slow

• Inefficient - unable to only retry the failed one

• Really slow

• Unpredictable server loading

Job 1 Job 2 Job 3Time

When the server loading is low

Job 4 Job 5

Serverloading

When the server loading is HIGH

Serverloading

Other task

Job 1Job 2

Job 4Job 5

Serverloading

Other task

Job 1Job 2

Job 4Job 5

Serverloading

Other task

Too many crawler processes executed at the same time

• Really slow

• Inefficient - unable to only retry the failed one.

• Scale out is not easy

• Inherent problems of Unix Cron:

• Unreliable scheduling

• High availability is not easy

• Hard to prioritize job by the popularity

• Not easy to deal with bandwidth throttling issue

Created by Mike Perham

Web serverRequest

Request

Process

Request

Job queue

push to queue(very fast)

Web server

Process

Request

Job queue

Worker process

Worker server

Worker process

Web server

Process

Request

Job queue

Worker process

Worker server

Worker process

Web server

Process Add extra servers when needed

Request

Job queue

Producer

Worker process

Worker server

Worker process

Web server

Process

Request

Job queue

Producer

Consumer

Worker process

Worker server

Worker process

Web server

Process

Worker process

thread 1

thread 2

thread 3

thread 25

Worker process v.s.

Multi-threadSingle process

Worker process

thread 1

thread 2

thread 3

thread 25

Worker process 1 : 25

Multi-threadSingle process

Multi-thread

Worker process

thread 1

thread 2

thread 3

thread 25

Single process

Worker process 1 : 25

With the same degree of memory consumption

Sidekiq (OSS) Sidekiq Pro

Sidekiq Enterprise

Sidekiq Pro Sidekiq Enterprise

Batches

Enhanced Reliability

Search in Web UI

Worker Metrics

Expiring Jobs

Rate Limiting

Periodic Jobs

Unique Jobs

Historical Metrics

Multi-process

Encryption

Parallelism Make Things Faster

• Really slow

• Efficient - only retry the failed one

• Predictable server loading

• Easy to scale out

• Really slow

• Inherent problem of Unix Cron:

–Mike Perham, CEO, Contributed Systems, Creator of Sidekiq

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

Keep states of cron executions in our robustest part of system - database

All scheduled jobs are invoked by a particular job executed minutely

create_table :cron_jobs do |t| t.string :klass, null: false t.string :cron_expression, null: false t.timestamp :next_run_at, null: false, index: true end

Create table for storing cron settingstable name: cron_jobs

Create table for storing cron settings

worker class name

Something like 0 */2 * * *

when will a job should be executed

klass cron_expression next_run_at

Push2000NewsJobs “0 */2 * * *” …

Push2000DailyPriceJobs “0 2 * * 1-5” …

Push2000MonthlyRevenueJobs “0 0 10 * *” …

# Add to your Cron setting every :minute do runner 'CronJobWorker.perform_async' end

Cron only schedules one job minutely

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job|

end end end

CronJobWorker to invoke all of your crawlers

Find jobs should be executed

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] )

end end end

Push jobs to job queue

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

Setup the next execution time

class CronJobWorker include Sidekiq::Worker def perform CronJob.find_each("next_run_at <= ?", Time.now) do |job| Sidekiq::Client.push( class: job.klass.constantize, args: ['foo', ‘bar'] ) x = Sidekiq::CronParser.new(job.cron_expression) job.update!(next_run_at: x.next.to_time) end end end

The missed job executions will be executed at next minute

Drawbacks solved

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

table: cron_jobs

klass cron_expression args next_run_at

Push2000NewsJobs “0 */2 * * *” [] …

NewsWorker “*/30 * * * *” [popular_stock_id_1] …

NewsWorker “*/30 * * * *” [popular_stock_id_2] …

Drawbacks solved

Sidekiq.configure_server do |config| config.periodic do |mgr| mgr.register("* * * * * *", CronJobWorker) end end

You always want your crawler as fast as possible

However, your target server doesn’t always allow you to crawl with

unlimited rate

Insert 2000 jobs to the queue at the same time

Stock.pluck(:id).each do |stock_id| SomeWorker.perform_async(stock_id) end

If you want to craw data for your 2000 stocks

Assume a target server accepts request at maximum rate equals to 1 request / second

Time (second)

job1 job2 job3

. job2000

Insert 2000 jobs to the queue at the same time

All of your jobs may be blocked (except the first one)

Improvement 1 Schedule jobs with incremental delays

Stock.pluck(:id).each_with_index do |stock_id, index| SomeWorker.perform_in(index, stock_id) end

Time (second)

job1 job2 job3

…job2000

Workable, but…

job1 job2 job3

…job2000

If the target server is unreachable

Time (second)

Workable, but…

job1 job2 job3

…job2000

If the target server is unreachable

job3~2000 will still execute at the same time

Time (second)

• Limit your worker thread to perform specific job with bounded rate

• Sidekiq Enterprise provides two types of rate limiting API

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end

CONCURRENT_LIMITER = Sidekiq::Limiter.concurrent('price', 10) def perform(...) CONCURRENT_LIMITER.within_limit do # crawl stock data end end Only 10 concurrent operations inside the block

can happen at any given moment

BUCKET_LIMITER = Sidekiq::Limiter.bucket('price', 10, :second) def perform(...) BUCKET_LIMITER.within_limit do # crawl stock data end end

For every second, you can perform up to 10 operations

You must fine tune parameters of your limiter for each data source for better performance

By far, you already got better performance.

However, the throttling control of your target server may not always be static.

Many websites are dynamically throttling controlled.

If throttling detected, pause your workers for a while

Redis (job queue)

default

critical

Redis (job queue)

default

critical

Worker thread

Redis (job queue)

default

critical

Worker thread

Redis (job queue)

default

critical

Worker thread

Worker threadyahoo

Redis (job queue)

default

critical

Worker thread

Worker threadyahoo

(paused)

Pause this queue when throttled

Redis (job queue)

default

critical

Worker thread

Schedule a job executed after few seconds to “unpause" job in another queue

yahoo(paused)

Redis (job queue)

default

critical

Worker thread

Worker threadyahoo

(resumed)

Resumed after the unpause queue job executed

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end

class SomeWorker include Sidekiq::Worker def perform # try to crawl something # ... if throttled queue_name = self.class.get_sidekiq_options['queue'] queue = Sidekiq::Queue.new(queue_name) queue.pause! ResumeJobQueueWorker.perform_in(30.seconds, queue_name) end end end class ResumeJobQueueWorker include Sidekiq::Worker sidekiq_options queue: :queue_control, unique: :until_executed def perform(queue_name) queue = Sidekiq::Queue.new(queue_name) queue.unpause! if queue.paused? end end

The queue for ResumeJobQueueWorker MUST NOT equal to the paused queue

We have a dedicated queue for ResumeJobQueueWorker

Decrease Sidekiq server poll interval for more precise timing control

Queue pausing alleviates throttling issues Is it possible for us to do things even better?

Most throttling control aim to block requests from the same IP address

We can change our IP address via proxy service

Sidekiq server

Target server

a.b.c.d

Sidekiq server

Target server

a.b.c.d

Sidekiq server

Target server

a.b.c.d

Same IP for each request

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

proxy servere.f.g.h

Sidekiq server

Target server

a.b.c.d

a.b.c.dProxy

service end

proxy server

e.f.g.h

i.j.k.l

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Sidekiq server

Target server

a.b.c.d

Proxy service

end point

proxy server

e.f.g.h

i.j.k.l

m.n.o.p

q.r.s.t

Different IP for each request

• With Sidekiq (Enterprise) and a proper design, the following problems are solved

• Slow crawler

• Inherent problem of Unix Cron

Building Efficient and Reliable Crawler System With Sidekiq Enterprise

Software

Transcript of Building Efficient and Reliable Crawler System With Sidekiq Enterprise

ERSTMALS AUF DER BAUMA R CEnTRIC hydrauliC Crawler drill · ERSTMALS AUF DER BAUMA 2010 VORGESTELLT. WHERE eXperienCe Counts DURABLE AnD RELIABLE hydrauliC Crawler drill. 1 / 2 TRADITIOn

Crawler Crane LR 1600/2 Grue sur chenilles · 2014. 11. 24. · Crawler travel gear Crawler chassis Liebherr crawler chassis consisting of one centre section and two crawler carriers

Textbook Crawler

Strong Reliable Machines Strong Reliable Support

HYDRAULIC CRAWLER CRANE CK2000 · 2015. 9. 20. · The Kobelco CK2000 Crawler Crane is designed from the ground up for reliable operation, ... All wiring corded for easy servicing,

HYDRAULIC CRAWLER CRANE CK2500...The Kobelco CK2500 Crawler Crane is designed from the ground up for reliable operation, convenient maintenance and easy transport. Please consult your

· 2019-04-16 · Company Profile HANDEX is a professional manufacturer of Telescopic Crawler Cranes. We supply cost effective, function intelligent and quality reliable cranes.

Web crawler .

HYDRAULIC CRAWLER CRANE - kcmu-cranes.comkcmu-cranes.com/wp-content/uploads/2016/09/CK2500... · The Kobelco CK2500-II Crawler Crane is designed from the ground up for reliable operation,

Crawler cranes

Smart crawler a two stage crawler

google crawler

Reliable Power Reliable Markets Reliable People Reliable Power Reliable Markets Reliable People Wind Integration in Alberta: M arket & Operational Framework.

HYDRAULIC CRAWLER CRANE - KOBELCO · PDF fileNew Design Upper Frame ... frame using a swift, reliable hydraulic device. Choice of Methods for ... trans-lifter (jack system) to lift

AGILITY IN MOTION · AGILITY IN MOTION 2 Outstanding productivity Boost your productivity with the Case heavy crawler excavator technology, reliable low ... The hydraulic cylinder

The Many Dimensions of SDR Hardware - GNU Radio · Supported by libsidekiq API and gr-sidekiq. GRCon 2017 Sidekiq X2 Block Diagram. GRCon 2017 Sidekiq X2 Deployment Options Sidekiq

Band Crawler

PRODUCT RANGE - Dressta · PRODUCT RANGE DRESSTA CONSTRUCTION EQUIPMENT. 2. 4–5 A Reliable Partner 6–7 S-Series Compact Crawler Dozers Hydrostatic Drive 8-9 R-Series Compact Crawler

RELIABLE RESEARCH JOURNAL - Prathmesh Publicationprathmeshpublication.in/pdf_data/reliable/2015/july_2015.pdf · RELIABLE RESEARCH JOURNAL AIMS & SCOPE New International Reliable

New CEN00431-04 PC850-8R1 CRAWLER EXCAVATOR · 2020. 1. 29. · HYDRAULIC EXCAVATOR PC850-8R1 ... Large Digging Force Highly Reliable Electronic Devices KMAX Bucket Teeth High Power