Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf ·...

Post on 10-Jul-2020

2 views 0 download

Transcript of Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf ·...

Project Skyfall

Matt Abrams (@abramsm)

Agenda

A bit about AddThis!!

Why did we need Skyfall?!!

Architecture!!

Operations/Performance!

Introduction!

Fun with Numbers

AddThis JavaScript loads > 3 Billion times per day Edge Network (Skyfall) receives around 4B hits per day Either datacenter can handle 100% load (we test this often) Currently using around 1K servers (will double next year)

Data Center Porn

Why did we need Skyfall?

We couldn’t find anyone else to do it for us •  Pervious vendors log aggregation was delayed by a

minimum of 3 hours and could take up to 5 days Minimize impact on our publishers

•  Combining log collection with remote services means we only need 1 event instead of n

Support near real time applications

Why did we call it Skyfall?

Why did we call it Skyfall?

Skyfall Goals and Architecture!

Skyfall Goals (Technical) High Availability Low latency Use for internal and external Logging needs O(1) reads and writes Smart Clients

Handle Server and DC failure gracefully Zero downtime deployment and configuration In session RPC Support data filtering at the edge

Why speed and robustness matters

Web Event Web Event Architecture Web Event

Skyfall Skyfall Skyfall

Consumer Consumer Consumer

Consumer

Service Service

Service

DC1

Skyfall Skyfall Skyfall

Consumer Consumer Consumer

Consumer

Service Service

Service

DC2

Global Traffic Management

Repeater

1.  Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer

2.  Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ)

3.  DBQ is used to provide temporary storage in case Kafka is down or backed up

4.  Messages from DBQ are popped and sent to Kafka where they are persisted to file system

Kafka Kafka is treats persistence as a first class citizen Focus is on high throughput vs lots of bells and whistles State about what has been consumed is maintained in the client rather than the server Kafka is explicitly distributed Supports O(1) reads and writes Pull rather than push

http://incubator.apache.org/kafka/design.html

Circuit Breaker for remote Services Pattern is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly[1]

If a service instance throws an error, times out, or responds with a failure message an error event is marked If the error rate threshold is exceeded that service instance is removed from the pool of available services Before re-adding a service to the pool a test request is made and validated Internal service failures should not be reflected in response to message originator

[1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code Bytes Transferred

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code Bytes Transferred

CDN Resource User Agent

What does a call to our endpoint look like?

"GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"

Topic Version Resource URL Parameters

CDN Resource User Agent

Status Code Bytes Transferred

The endpoint also receives header and cookie information not Shown here.

Zero Downtime Deployment and Configuration

S1 S2 S2 S3 S3 4

S4 S4 8

S5 S5 16

Group 1

S1 S2 S2 S3 S3 4

S4 S4 8

S5 S5 16

Group 2

Endpoint Configuration

Each endpoint maps to a ‘topic’ Header elements may be extracted from the HTTP request Parameters may be mapped to new key names Variables may be extracted from the URL path

Data Center Repeater

N1

N2

N3

N1

N2

DC Repeater nodes automatically negotiate peering relationships with nodes in the other data center If a peer node becomes unreachable the local node will select a new peer These are special consumers of the Kafka log data created by the local node

Skyfall Operations!

Requests per/second (VA Data Center)

TCP - When do you say goodbye?

http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg

Connection Tracking – what you need to know Connection information is maintained in memory The message: “ip_conntrack: table full, dropping packet” is BAD Chrome – doesn’t close connection on FIN This means that the connection info remains open until it times out, drastically increasing the number of connection your server needs to track You need some mechanism for timing out the connection in a reasonable time period

HA Proxy We use a simple round-robin load balancing algorithm with a liveness check Default connection timeouts are way to high. Reasonable values are used to prevent excessive connection tracking “http-close” and “http-server-close” are enabled to ensure low latency for clients and fast session reuse for the server HA Proxy is our solution of choice our LB needs. We prefer software solutions on commodity hardware vs expensive custom LB appliances They could use a new logo