Lessons Learned from Our First Worldwide Outage...2 © 2017 Imperva, Inc. All rights reserved. CDN...

20
© 2017 Imperva, Inc. All rights reserved. Lessons Learned from Our First Worldwide Outage Yoav Cohen, VP Engineering, Incapsula @yoavcohen

Transcript of Lessons Learned from Our First Worldwide Outage...2 © 2017 Imperva, Inc. All rights reserved. CDN...

  • © 2017 Imperva, Inc. All rights reserved.

    Lessons Learned from Our First Worldwide OutageYoav Cohen, VP Engineering, Incapsula@yoavcohen

  • © 2017 Imperva, Inc. All rights reserved.2

    OriginCDN

    Caching Rules

    L3/4 Scrubbing

    DDoS Rules

    WAF

    WAF Rules

    What is Incapsula?

    LoadBalancer

    Application Delivery Rules

    BotManager

    ClientClassification

  • © 2017 Imperva, Inc. All rights reserved.3

    OriginCDN L3/4 Scrubbing WAF

    What is Incapsula?

    LoadBalancer

    BotManager

    Caching

    DDoS Rules

    WAF Rules

    Client Classification

    App Delivery Rules

  • © 2017 Imperva, Inc. All rights reserved.

    Incapsula in High Level

    • ~35 PoPs– Proxies (HTTP/S)– Behemoths (L3-4)

    • Management Console

    Confidential4

    ManagementConsole

  • © 2017 Imperva, Inc. All rights reserved.

    Customer Configuration

    • Domain name, IP of origin server

    • All servers need to know that

    • With many updates happening at any given time

    • 60 seconds SLA

    Confidential5

  • PRX/BH SERVERPRX/BH SERVER

    REPO SERVER

    PRX/BH SERVER

    Revlite

    • Leader repository

    • A client service that synchronizes changes to the file-system

    • Applications read from the file-system

    Confidential6

    LeaderRepo

    revclient

    Local Repo

    Application

    ManagementConsole

  • © 2017 Imperva, Inc. All rights reserved.

    Our First Global Outage

    Confidential7

    ManagementConsole

  • © 2017 Imperva, Inc. All rights reserved.

    IncapRules

    • Custom WAF logic

    • Tons of flexibility

    • High risk– Reading rules– Evaluating rules

    Confidential8

  • © 2017 Imperva, Inc. All rights reserved.

    IncapRules Safety Mechanism

    Confidential9

    curr_sites = []

    read_rules(site_id=7)eval_rules(site_id=7)

    curr_sites.txt:[7]

    crashstart-up

    curr_sites = [7]

  • © 2017 Imperva, Inc. All rights reserved.

    Initial Response

    Confidential10

    ManagementConsole

  • © 2017 Imperva, Inc. All rights reserved.

    Post Mortem Conclusions

    1. The safety mechanism did not work

    2. The safety mechanism only works after a crash…

    Confidential11

  • © 2017 Imperva, Inc. All rights reserved.

    1. The safety mechanism did not work

    • // Someone commented it out

    • Why wasn’t it picked-up by testing?

    • Because we didn’t have a test for it

    Confidential12

  • © 2017 Imperva, Inc. All rights reserved.

    Incapsula Reliability Framework

    Confidential13

    System Design

    Planning

    P0 – Customer Traffic

    P1 – System Administration

    P2 – Customer Administration

  • © 2017 Imperva, Inc. All rights reserved.

    2. The safety mechanism only works after a crash

    • Simple (bad) solution – propagate slowly– How slow?– Will still impact production systems– Will degrade customer agility

    • Break down the problem:– Crash when loading rules– Crash when evaluating rules à Safety Mechanism

    Confidential14

  • © 2017 Imperva, Inc. All rights reserved.

    Crash when loading rules

    • Don’t publish configurations that will cause a crash• But, crash happens when publishing

    • Publish to a system that is not processing traffic

    Confidential15

  • © 2017 Imperva, Inc. All rights reserved.

    PROXYPROXY

    Revlite Sandbox

    Confidential16

    REPO SERVER

    SANDBOX PROXY

    SandboxRepo

    Local Repo

    Application

    ManagementConsole

    LeaderRepo

    PROXY

    Local Repo

    Application

  • © 2017 Imperva, Inc. All rights reserved.

    Flying two mistakes high

    Confidential17

    ManagementConsole

    SandboxRepo

    LeaderRepo

    Local Repo

    Application

  • © 2017 Imperva, Inc. All rights reserved.

    Revert to a last known good

    Confidential18

    Local Repo

    T-24h

    ManagementConsole

    SandboxRepo

    LeaderRepo

    Local Repo

    Application

    Local Repo

    T-60m

    Local Repo

    T-5m

    T

  • © 2017 Imperva, Inc. All rights reserved.

    Wrapping it up

    • Configuration changes can cause issues– While loading the change– After the fact, out of the blue

    • Issues during a configuration change– Sandbox the change– Quarantine it if necessary

    • Out of the blue– Detect ”bad” objects and quarantine them– Roll-back to a last known good configuration

    Confidential19

  • Questions?

    @yoavcohen