Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf ·...

34
Project Skyfall Matt Abrams (@abramsm)

Transcript of Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf ·...

Page 1: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Project Skyfall

Matt Abrams (@abramsm)

Page 2: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Agenda

A bit about AddThis!!

Why did we need Skyfall?!!

Architecture!!

Operations/Performance!

Page 3: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Introduction!

Page 4: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need
Page 5: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Fun with Numbers

AddThis JavaScript loads > 3 Billion times per day Edge Network (Skyfall) receives around 4B hits per day Either datacenter can handle 100% load (we test this often) Currently using around 1K servers (will double next year)

Page 6: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Data Center Porn

Page 7: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Why did we need Skyfall?

We couldn’t find anyone else to do it for us •  Pervious vendors log aggregation was delayed by a

minimum of 3 hours and could take up to 5 days Minimize impact on our publishers

•  Combining log collection with remote services means we only need 1 event instead of n

Support near real time applications

Page 8: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Why did we call it Skyfall?

Page 9: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Why did we call it Skyfall?

Page 10: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Skyfall Goals and Architecture!

Page 11: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Skyfall Goals (Technical) High Availability Low latency Use for internal and external Logging needs O(1) reads and writes Smart Clients

Handle Server and DC failure gracefully Zero downtime deployment and configuration In session RPC Support data filtering at the edge

Page 12: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Why speed and robustness matters

Page 13: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Web Event Web Event Architecture Web Event

Skyfall Skyfall Skyfall

Consumer Consumer Consumer

Consumer

Service Service

Service

DC1

Skyfall Skyfall Skyfall

Consumer Consumer Consumer

Consumer

Service Service

Service

DC2

Global Traffic Management

Repeater

Page 14: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need
Page 15: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

1.  Messages are placed on concurrent non-blocking queue (CNBQ) to minimize latency impact on producer

2.  Messages are then popped from CNBQ and placed on a Disk-Backed queue (DBQ)

3.  DBQ is used to provide temporary storage in case Kafka is down or backed up

4.  Messages from DBQ are popped and sent to Kafka where they are persisted to file system

Page 16: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Kafka Kafka is treats persistence as a first class citizen Focus is on high throughput vs lots of bells and whistles State about what has been consumed is maintained in the client rather than the server Kafka is explicitly distributed Supports O(1) reads and writes Pull rather than push

http://incubator.apache.org/kafka/design.html

Page 17: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Circuit Breaker for remote Services Pattern is used to detect failures and encapsulates logic of preventing a failure to reoccur constantly[1]

If a service instance throws an error, times out, or responds with a failure message an error event is marked If the error rate threshold is exceeded that service instance is removed from the pool of available services Before re-adding a service to the pool a test request is made and validated Internal service failures should not be reflected in response to message originator

[1] - http://en.wikipedia.org/wiki/Circuit_breaker_design_pattern

Page 18: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic

Page 19: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version

Page 20: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource

Page 21: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params

Page 22: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code

Page 23: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code Bytes Transferred

Page 24: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

•  "GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"!

Topic Version Resource URL Params Status Code Bytes Transferred

CDN Resource User Agent

Page 25: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

What does a call to our endpoint look like?

"GET /live/t00/250lo.gif&foo=bar" 200 37 "http://s7.addthis.com/static/r07/sh103.html" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)"

Topic Version Resource URL Parameters

CDN Resource User Agent

Status Code Bytes Transferred

The endpoint also receives header and cookie information not Shown here.

Page 26: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Zero Downtime Deployment and Configuration

S1 S2 S2 S3 S3 4

S4 S4 8

S5 S5 16

Group 1

S1 S2 S2 S3 S3 4

S4 S4 8

S5 S5 16

Group 2

Page 27: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Endpoint Configuration

Each endpoint maps to a ‘topic’ Header elements may be extracted from the HTTP request Parameters may be mapped to new key names Variables may be extracted from the URL path

Page 28: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Data Center Repeater

N1

N2

N3

N1

N2

DC Repeater nodes automatically negotiate peering relationships with nodes in the other data center If a peer node becomes unreachable the local node will select a new peer These are special consumers of the Kafka log data created by the local node

Page 29: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Skyfall Operations!

Page 30: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need
Page 31: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Requests per/second (VA Data Center)

Page 32: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

TCP - When do you say goodbye?

http://upload.wikimedia.org/wikipedia/commons/a/a2/Tcp_state_diagram_fixed.svg

Page 33: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

Connection Tracking – what you need to know Connection information is maintained in memory The message: “ip_conntrack: table full, dropping packet” is BAD Chrome – doesn’t close connection on FIN This means that the connection info remains open until it times out, drastically increasing the number of connection your server needs to track You need some mechanism for timing out the connection in a reasonable time period

Page 34: Project Skyfall Matt Abrams (@abramsm)files.meetup.com/1810133/BigDataDC_Skyfall_Preso_V2.pdf · Project Skyfall Matt Abrams (@abramsm) Agenda A bit about AddThis!! Why did we need

HA Proxy We use a simple round-robin load balancing algorithm with a liveness check Default connection timeouts are way to high. Reasonable values are used to prevent excessive connection tracking “http-close” and “http-server-close” are enabled to ensure low latency for clients and fast session reuse for the server HA Proxy is our solution of choice our LB needs. We prefer software solutions on commodity hardware vs expensive custom LB appliances They could use a new logo