Lessons Learned from Our First Worldwide Outage...2 © 2017 Imperva, Inc. All rights reserved. CDN...
Transcript of Lessons Learned from Our First Worldwide Outage...2 © 2017 Imperva, Inc. All rights reserved. CDN...
-
© 2017 Imperva, Inc. All rights reserved.
Lessons Learned from Our First Worldwide OutageYoav Cohen, VP Engineering, Incapsula@yoavcohen
-
© 2017 Imperva, Inc. All rights reserved.2
OriginCDN
Caching Rules
L3/4 Scrubbing
DDoS Rules
WAF
WAF Rules
What is Incapsula?
LoadBalancer
Application Delivery Rules
BotManager
ClientClassification
-
© 2017 Imperva, Inc. All rights reserved.3
OriginCDN L3/4 Scrubbing WAF
What is Incapsula?
LoadBalancer
BotManager
Caching
DDoS Rules
WAF Rules
Client Classification
App Delivery Rules
-
© 2017 Imperva, Inc. All rights reserved.
Incapsula in High Level
• ~35 PoPs– Proxies (HTTP/S)– Behemoths (L3-4)
• Management Console
Confidential4
ManagementConsole
-
© 2017 Imperva, Inc. All rights reserved.
Customer Configuration
• Domain name, IP of origin server
• All servers need to know that
• With many updates happening at any given time
• 60 seconds SLA
Confidential5
-
PRX/BH SERVERPRX/BH SERVER
REPO SERVER
PRX/BH SERVER
Revlite
• Leader repository
• A client service that synchronizes changes to the file-system
• Applications read from the file-system
Confidential6
LeaderRepo
revclient
Local Repo
Application
ManagementConsole
-
© 2017 Imperva, Inc. All rights reserved.
Our First Global Outage
Confidential7
ManagementConsole
-
© 2017 Imperva, Inc. All rights reserved.
IncapRules
• Custom WAF logic
• Tons of flexibility
• High risk– Reading rules– Evaluating rules
Confidential8
-
© 2017 Imperva, Inc. All rights reserved.
IncapRules Safety Mechanism
Confidential9
curr_sites = []
read_rules(site_id=7)eval_rules(site_id=7)
curr_sites.txt:[7]
crashstart-up
curr_sites = [7]
-
© 2017 Imperva, Inc. All rights reserved.
Initial Response
Confidential10
ManagementConsole
-
© 2017 Imperva, Inc. All rights reserved.
Post Mortem Conclusions
1. The safety mechanism did not work
2. The safety mechanism only works after a crash…
Confidential11
-
© 2017 Imperva, Inc. All rights reserved.
1. The safety mechanism did not work
• // Someone commented it out
• Why wasn’t it picked-up by testing?
• Because we didn’t have a test for it
Confidential12
-
© 2017 Imperva, Inc. All rights reserved.
Incapsula Reliability Framework
Confidential13
System Design
Planning
P0 – Customer Traffic
P1 – System Administration
P2 – Customer Administration
-
© 2017 Imperva, Inc. All rights reserved.
2. The safety mechanism only works after a crash
• Simple (bad) solution – propagate slowly– How slow?– Will still impact production systems– Will degrade customer agility
• Break down the problem:– Crash when loading rules– Crash when evaluating rules à Safety Mechanism
Confidential14
-
© 2017 Imperva, Inc. All rights reserved.
Crash when loading rules
• Don’t publish configurations that will cause a crash• But, crash happens when publishing
• Publish to a system that is not processing traffic
Confidential15
-
© 2017 Imperva, Inc. All rights reserved.
PROXYPROXY
Revlite Sandbox
Confidential16
REPO SERVER
SANDBOX PROXY
SandboxRepo
Local Repo
Application
ManagementConsole
LeaderRepo
PROXY
Local Repo
Application
-
© 2017 Imperva, Inc. All rights reserved.
Flying two mistakes high
Confidential17
ManagementConsole
SandboxRepo
LeaderRepo
Local Repo
Application
-
© 2017 Imperva, Inc. All rights reserved.
Revert to a last known good
Confidential18
Local Repo
T-24h
ManagementConsole
SandboxRepo
LeaderRepo
Local Repo
Application
Local Repo
T-60m
Local Repo
T-5m
T
-
© 2017 Imperva, Inc. All rights reserved.
Wrapping it up
• Configuration changes can cause issues– While loading the change– After the fact, out of the blue
• Issues during a configuration change– Sandbox the change– Quarantine it if necessary
• Out of the blue– Detect ”bad” objects and quarantine them– Roll-back to a last known good configuration
Confidential19
-
Questions?
@yoavcohen