Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer...

83

Transcript of Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer...

Page 1: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 2: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Hi, I’m

Andrew Godwin

• Django core developer

• Senior Software Engineer at

• Private + Instrument pilot

Page 3: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Content Warning

Page 4: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Software is difficult.

Page 5: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

By Derek Lowe"Things I won't work with"

Page 6: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

On Hexanitrohexaazaisowurtzitane

"...a more stable form of it, by mixing it with TNT.

Yes, this is an example of something that becomes less explosive as a one-to-one cocrystal with TNT."

Page 7: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

On “Sand Won’t Save You This Time”

"...the operator is confronted with the problem of copingwith a metal-fluorine fire.

For dealing with this situation, I have alwaysrecommended a good pair of running shoes."

Page 8: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Unicode

Locales

Time

Calendars

Geography

Money

Page 9: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Network latency

Hardware unreliability

Deadlocks

Bit flips

Ambiguous specifications

No documentation

Page 10: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

We just move faster and hit them at higher speed.Not unique to software

Page 11: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 12: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Who's solved this? Aviation.

Page 13: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

A Boeing 747 has six million parts

Page 14: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

…and a 0.000006% accident rateA Boeing 747 has six million parts

Page 15: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Airplane

Car

Walking

Train

220

130

30.8

Deaths per billion hours (Per passenger, UK 1990-2000)

30

Page 16: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

People matter as much as machines

Page 17: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Pilot 76%

Aviation Accident Causes (2005 Nall report)

9% Other

16% Mechanical

Page 18: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

And how we can apply them to software.Let's look at some aviation principles

Page 19: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #1Hard Failure

Page 20: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

If something is wrong it turns itself offAutopilots, engines, air conditioning, and more

Page 21: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

This only works if you have redundancyAll of these systems have a backup that lets you land.

Page 22: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

"We'll ignore errors so the site doesn't crash!"

"Save the invalid data and we'll fix it later"

Page 23: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

These are great ways to ensure younever fix something.

Page 24: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

No accident or outage has a single cause.Stop your code getting into odd states.

Page 25: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Fail hard if anything unexpected happens

Validate all your data strictly in and out

Deploy changes early and often

Page 26: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Single points of failure can be goodOnly one place to look when things go wrong!

Page 27: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 28: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #2Good Alerting

Page 29: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Cockpits are incredibly selective about what sets off an audio alarm

Page 30: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Alert fatigue is real. Avoid at all costs.

Page 31: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Never, ever, put all errors in the same place

Page 32: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Critical

Normal

Background

Page 33: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Critical

Normal

Background

Wakes someone up. Actionable.

Page 34: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Critical

Normal

Background

Wakes someone up. Actionable.

Fixed over the next week.

Page 35: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Critical

Normal

Background

Wakes someone up. Actionable.

Fixed over the next week.

Metrics, not errors.

Page 36: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Have you been ignoring an error for weeks?Then turn off its error reporting.

Page 37: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #3Find your limits

Page 38: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Everything will fail. You should know when.

Page 39: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Copyright Boeing

Page 40: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

What's your Minimum Equipment List?What can you run the system without?

Page 41: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Lavatory ashtrays

Air conditioning

Seatbelt signs

Passenger video screens

Fuel caps

Weather radar

REQUIRED OPTIONAL

Page 42: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Did you load test? Did you fuzz test?

Page 43: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

You don't have to perfectly scale.But you do have to know where your limits are.

Page 44: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Risk is fine when you're informed!Unknowns are the most dangerous thing.

Page 45: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #4Build for failure

Page 46: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

No single thing in an aircraft canfail and take it down.

Page 47: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

We all want this for our code, butthe way to do it is to build for failure.

Page 48: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Kill your application randomly

Practice server network failures

Develop on unreliable connections

Page 49: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

The majority of pilot training ishandling emergencies.

Page 50: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 51: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Use checklists. Don't rely on memory.

Page 52: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

If you practice failure, you'll be readywhen the inevitable happens.

Page 53: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Pilot 76%

Aviation Accident Causes (2005 Nall report)

9% Other

16% Mechanical

Page 54: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #5Communicate well

Page 55: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

"You have control""I have control"

"You have control"

Page 56: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Complex software meansseparate teams.

Page 57: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

As you grow, communication becomesexponentially harder.

Page 58: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 59: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 60: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 61: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Clear communication is vital.

Page 62: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Write everything down.Written specs = less time in meetings.

Page 63: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Have a clear chain of command.

Page 64: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Make decisions.They don't have to be perfect, just good enough.

Page 65: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Principle #6No blame culture

Page 66: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

How do I know all these aviation stats?

Page 67: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Every incident is reported and investigated.

Page 68: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

There is never a single cause of a problem.

Page 69: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Make it very difficult to do again.Why did your software let this happen? What's the UX of your admin tools like?

Page 70: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 71: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 72: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Encourage reporting.Don't blame anyone for a mistake. They're unlikely to make it again.

Page 73: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Reward maintenance as well as firefightingIt's easy to look good when you ship broken and are always heroically fixing it.

Page 74: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 75: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

In aviation, every rule is written in blood.

Page 76: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Software is not yet there.But we are getting closer.

Page 77: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Margaret HamiltonHer error detection code saved Apollo 11

Page 78: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Patriot MissileFloating-point bug killed 28

Page 79: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Therac-25Killed 3, severely injured

at least 3 more

Page 80: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Uber AutonomousVehicleSaw a pedestrian and chose

to hit her

Page 81: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software
Page 82: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Hard failureGood alerting

Find your limitsBuild for failure

Communicate wellNo blame culture

Page 83: Hi, I’m - Amazon S3 › pyconil-data-amit › ...Hi,I’m Andrew Godwin • Django core developer • Senior Software Engineer at • Private + Instrument pilot Content Warning Software

Thanks.Andrew Godwin

@andrewgodwin aeracode.org