Download - Challenges and Solutions in Autonomous Vehicle Validationkoopman/lectures/2017_av.pdf · 1990. 1995. 2000. 2005. 2010. DARPA. Grand . Challenge. DARPA. LAGR. ARPA. Demo II. DARPA.

Challenges and Solutions in

Autonomous Vehicle Validation

© 2017 Edge Case Research LLC

Prof. Philip Koopman

2© 2017 Edge Case Research LLC

Self-driving cars are so cool! But also kind of scary

Need many skills: Good technical skills Good process skills Safety & security Management who “gets it” AND:

– Safety approach that is more than testing– Approaches to validate Machine Learning

How Do You Validate Autonomous Vehicles?

https://en.wikipedia.org/wiki/Autonomous_car


1985 1990 1995 2000 2005 2010

DARPAGrand

Challenge

DARPALAGR

ARPADemo

II

DARPASC-ALV

NASALunar Rover

NASADante II

AutoExcavator

AutoHarvesting

Auto Forklift

Mars Rovers

Urban Challenge

DARPAPerceptOR

DARPAUPI

Auto Haulage

Auto Spraying

Laser Paint RemovalArmy

FCS

Carnegie Mellon University Faculty, staff, studentsOff-campus Robotics Institute facility

NREC: 30+ Years Of Cool Robots

SoftwareSafety


Even the “easy” (well known) cases are challenging

Lesson Learned: Validating Robots Is Difficult

Extreme contrast No markings Poor visibility

Unusual obstacles

Construction Water (appears flat!)[Wagner 2014]


The Tough Cases Are Legion

http://piximus.net/fun/funny-and-odd-things-spotted-on-the-roadhttp://edtech2.boisestate.edu/robertsona/506/images/buffalo.jpghttps://www.flickr.com/photos/hawaii-mcgraths/4458907270/in/photolist-Y59LC-7TNzAv-5WSEds-7N24Myhttps://pixabay.com/en/bunde-germany-landscape-sheep-92931/


6/2016: Tesla “AutoPilot” Crash Claimed first crash after 130M miles of operation But, that does not assure 130M miles between crashes!

Are Self-Driving Cars Safe Enough?


Need to test for at least ~3x crash rate to validate safety Hypothetical deployment: NYC Medallion Taxi Fleet 13,437 vehicles @ 70,000 miles/yr = 941M miles/year 7 critical crashes in 2015

134M miles/critical crash (death or serious injury)

How much testing to validate critical crash rate? Answer: 3x – ~10x the mean crash rate

– 3x is without any crash– If you get a crash, you need to test longer

Design changes reset the testing clock Assumes random independent arrivals Exponential inter-arrival spacing Is this a good assumption?

Billions of Miles Of Testing?

Testing Miles

Confidence if NOcritical crash seen

122.8M 60%308.5M 90%401.4M 95%617.1M 99%

[2014 NYC Taxi Fact Book][Fatal and Critical Injury data / Local Law 31 of 2014]


Mean time between road hazards is too loose a metric How many miles to see all hazards that arrive, on average, every 1M miles?

– Case 1: 10 unique hazards at 10 million miles/hazard ~100 million miles of testing– Case 2: 100,000 unique hazards at 100 billion miles/hazard ~1 trillion miles of testing

Heavy Tail Inter-Arrival Rates

Random Independent Arrival Rate (exponential)

Power Law Arrival Rate (80/20 rule)


Power Law Arrival Rate (80/20 rule)

Cross-Over at 4000 hrs


Power Law Arrival Rate (80/20 rule) Many , Infrequent ScenariosTotal Area is the same!

Different

Cross-Over at 4000 hrs


At best, each road hazard type arrives independently But, that doesn’t tell us how often each event arrives No surprise if distribution of scenarios is heavy-tailed

– E.g., exponential arrivals, but power law distribution of scenarios

Beyond that, conditions are not random/independent Correlations: geographic region, weather, holiday, commute Betting against a heavy tail edge case distribution is risky

Need to think more deeply than “drive a lot of miles” Need to have some assurance your system will work beyond

accumulating miles That’s what software safety approaches are for!

– OK, so what happens if you try to apply ISO 26262? …

Is Trillions of Miles of Testing Enough?https://goo.gl/VmfM22

Haleakalā National Park

https://goo.gl/PUACcJ


Deep Learning Can Be Brittle & Inscrutable

https://goo.gl/5sKnZV

QuocNet:

Car Not aCar

MagnifiedDifference

BusNot aBus

MagnifiedDifference

AlexNet:

Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).


Machine Learning doesn’t fit the V ML “learns” how to behave via training data

– ML software itself is just a framework The training data forms de facto requirements

– The “magic” of ML is being able to skip requirements– But, how do we trace that to acceptance tests?

» We’ll need to argue training data or test data is “complete”

Is training data “complete” and correct? Training data is safety critical What if a moderately rare case isn’t trained?

– It might not behave as you expect– People’s perception of “almost the same”

does not necessarily predict ML responses!

How Do We Trace To Safety Standard “V”?REQUIREMENTSSPECIFICATION

SYSTEMSPECIFICATION

SUBSYSTEM/COMPONENT

SPECIFICATION

PROGRAMSPECIFICATION

MODULESPECIFICATION

SOURCECODE

UNIT TEST

PROGRAMTEST

SUBSYSTEM/COMPONENT

TEST

SYSTEMINTEGRATION

& TEST

ACCEPTANCETEST

VERIFICATION &TRACEABILITY

VALIDATION & TRACEABILITY




Review

Review

Review

Review

Review

Review

Review

Review

Review

Review

Review

Cluster Analysis

?

?


Validating everything at once is infeasible Might need 1 billion to 1 trillion miles of road test Try breaking this up into two steps: requirements and implementation

Validate requirements: what is the correct behavior? Taxonomy of driving scenarios tracing data collection to test scenarios Taxonomy of edge cases tracing data collection to test scenarios Challenge: edge case for ML system might be different than edge case for a human driver

Validate implementation: does vehicle behave correctly? Test scenarios are only a start – what if system is brittle?

– Does it correctly identify which scenario it is experiencing?– Robustness testing can help identify problems

It’s not all ML; use traditional safety for the rest (ISO 26262)– Might use non-ML software to ensure ML safety

Validation: Requirements vs. Implementation

DISTRIBUTION A – NREC case number STAA-2013-10-02

https://goo.gl/g3t3P9


Safety Envelope: Specify unsafe regions for safety Specify safe regions for functionality

– Deal with complex boundary via:» Under-approximate safe region» Over-approximate unsafe region

Trigger system safety responseupon transition to unsafe region

Partition the requirements: Operation: functional requirements Failsafe: safety requirements (safety functions)

Safety Envelope Approach to ML Deployment


“Doer” subsystem Implements normal functionality Allocate functional requirements to Doer

– Put all the Machine Learning content here!

“Checker” subsystem – Traditional SW Implements failsafes (safety functions) Allocate safety requirements to Checker

– Use traditional software safety techniques

Checker is entirely responsible for safety Doer can be at low Safety Integrity Level (SIL)

– Failure is lack of availability, not a mishap Checker must be at high SIL (failure is unsafe), e.g., ISO 26262

– Often, Checker can be much simpler than Doer

Architecting A Safety Envelope SystemDoer/Checker Pair

Low SIL

High SILSimpleSafetyEnvelopeChecker

ML

APD (Autonomous Platform Demonstrator)How did we make this scenario safe?

TARGET GVW: 8,500 kg TARGET SPEED: 80 km/hr

Approved for Public Release. TACOM Case #20247 Date: 07 OCT 2009

The Autonomous Platform Demonstrator (APD) was the first UGV to use a Safety Monitor as part of its safety case.

As a result, the U.S. Army approved APD for demonstrations involving soldier participation.

U.S. Army cites high quality of APD safety case and turns to NREC to improve the safety of unmanned vehicles.

Approved for Public Release – Distribution is Unlimited (NREC case #: STAA-2012-10-17)

Fail-SafeApproach:

Fail-OperationalApproach:


ASTAA: Automated Stress Testing of Autonomy Architectures Key idea: combination of exceptional &

normal inputs to an interface Builds on 20 years of research experience

– Ballista OS robustness testing in 1990s … to …– Autonomous vehicle testing starting in 2010

Example: Ground Vehicle CAN interception: Test Injector

– Selectively modifies CAN messages on the fly– Modification based on data type information

Invariant monitor– Reads messages for invariant evaluation– “Checker” invariant monitor detects failures

Commercial tool build-out: Edge Case Research Switchboard (software & hardware interface testing)

Robustness Testing



Improper handling of floating-point numbers: Inf, NaN, limited precision

Array indexing and allocation: Images, point clouds, etc… Segmentation faults due to arrays that are too small Many forms of buffer overflow with complex data types Large arrays and memory exhaustion

Time: Time flowing backwards, jumps Not rejecting stale data

Problems handling dynamic state: For example, lists of perceived objects or command trajectories Race conditions permit improper insertion or removal of items Garbage collection causes crashes or hangs

Robustness Testing Finds Problems



Good job of advocating for progress E.g., recommends ISO 26262 safety standard

But, some areas need work, such as: Machine Learning validation is immature Need independent safety assessment Re-certification for all updates to critical

components Which long-tail exceptional situations are

“reasonable” to mitigate Crash data recorders must have forensic validity

Open response to NHTSA at: https://goo.gl/ZumJss https://betterembsw.blogspot.com/2016/10/draft-response-to-

dot-policy-on-highly.html

Comments On NHTSA Policy


Multi-prong approach required for validation Need rigorous engineering, not just vehicle testing Account for heavy-tail distribution of scenarios

– Too many to just learn them all?– How does system recognize it doesn’t know a scenerio?

Unique Machine Learning validation challenges How do we create “requirements” for an ML system? How do we ensure that testing traces to the ML training data? How do we ensure adequate requirements and testing coverage for the real world?

Two techniques that we’re putting into practice Safety monitor: let ML optimize behavior while guarding against the unexpected Robustness testing: inject faults into system building blocks to uncover faults

Conclusions

[General Motors]


Specialists in Making Robust Autonomy Software Embedded software quality assessment and improvement Embedded software skill evaluation and training Lightweight software processes for mission critical systems Lightweight automation for process & technical area improvement

[email protected] http://www.edge-case-research.com/