Challenges and Solutions in
Autonomous Vehicle Validation
© 2017 Edge Case Research LLC
Prof. Philip Koopman
2© 2017 Edge Case Research LLC
Self-driving cars are so cool! But also kind of scary
Need many skills: Good technical skills Good process skills Safety & security Management who “gets it” AND:
– Safety approach that is more than testing– Approaches to validate Machine Learning
How Do You Validate Autonomous Vehicles?
https://en.wikipedia.org/wiki/Autonomous_car
3© 2017 Edge Case Research LLC
1985 1990 1995 2000 2005 2010
DARPAGrand
Challenge
DARPALAGR
ARPADemo
II
DARPASC-ALV
NASALunar Rover
NASADante II
AutoExcavator
AutoHarvesting
Auto Forklift
Mars Rovers
Urban Challenge
DARPAPerceptOR
DARPAUPI
Auto Haulage
Auto Spraying
Laser Paint RemovalArmy
FCS
Carnegie Mellon University Faculty, staff, studentsOff-campus Robotics Institute facility
NREC: 30+ Years Of Cool Robots
SoftwareSafety
4© 2017 Edge Case Research LLC
Even the “easy” (well known) cases are challenging
Lesson Learned: Validating Robots Is Difficult
Extreme contrast No markings Poor visibility
Unusual obstacles
Construction Water (appears flat!)[Wagner 2014]
5© 2017 Edge Case Research LLC
The Tough Cases Are Legion
http://piximus.net/fun/funny-and-odd-things-spotted-on-the-roadhttp://edtech2.boisestate.edu/robertsona/506/images/buffalo.jpghttps://www.flickr.com/photos/hawaii-mcgraths/4458907270/in/photolist-Y59LC-7TNzAv-5WSEds-7N24Myhttps://pixabay.com/en/bunde-germany-landscape-sheep-92931/
6© 2017 Edge Case Research LLC
6/2016: Tesla “AutoPilot” Crash Claimed first crash after 130M miles of operation But, that does not assure 130M miles between crashes!
Are Self-Driving Cars Safe Enough?
7© 2017 Edge Case Research LLC
Need to test for at least ~3x crash rate to validate safety Hypothetical deployment: NYC Medallion Taxi Fleet 13,437 vehicles @ 70,000 miles/yr = 941M miles/year 7 critical crashes in 2015
134M miles/critical crash (death or serious injury)
How much testing to validate critical crash rate? Answer: 3x – ~10x the mean crash rate
– 3x is without any crash– If you get a crash, you need to test longer
Design changes reset the testing clock Assumes random independent arrivals Exponential inter-arrival spacing Is this a good assumption?
Billions of Miles Of Testing?
Testing Miles
Confidence if NOcritical crash seen
122.8M 60%308.5M 90%401.4M 95%617.1M 99%
[2014 NYC Taxi Fact Book][Fatal and Critical Injury data / Local Law 31 of 2014]
8© 2017 Edge Case Research LLC
Mean time between road hazards is too loose a metric How many miles to see all hazards that arrive, on average, every 1M miles?
– Case 1: 10 unique hazards at 10 million miles/hazard ~100 million miles of testing– Case 2: 100,000 unique hazards at 100 billion miles/hazard ~1 trillion miles of testing
Heavy Tail Inter-Arrival Rates
Random Independent Arrival Rate (exponential)
Power Law Arrival Rate (80/20 rule)
Random Independent Arrival Rate (exponential)
Power Law Arrival Rate (80/20 rule)
Cross-Over at 4000 hrs
Random Independent Arrival Rate (exponential)
Power Law Arrival Rate (80/20 rule) Many , Infrequent ScenariosTotal Area is the same!
Different
Cross-Over at 4000 hrs
9© 2017 Edge Case Research LLC
At best, each road hazard type arrives independently But, that doesn’t tell us how often each event arrives No surprise if distribution of scenarios is heavy-tailed
– E.g., exponential arrivals, but power law distribution of scenarios
Beyond that, conditions are not random/independent Correlations: geographic region, weather, holiday, commute Betting against a heavy tail edge case distribution is risky
Need to think more deeply than “drive a lot of miles” Need to have some assurance your system will work beyond
accumulating miles That’s what software safety approaches are for!
– OK, so what happens if you try to apply ISO 26262? …
Is Trillions of Miles of Testing Enough?https://goo.gl/VmfM22
Haleakalā National Park
https://goo.gl/PUACcJ
10© 2017 Edge Case Research LLC
Deep Learning Can Be Brittle & Inscrutable
https://goo.gl/5sKnZV
QuocNet:
Car Not aCar
MagnifiedDifference
BusNot aBus
MagnifiedDifference
AlexNet:
Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199 (2013).
11© 2017 Edge Case Research LLC
Machine Learning doesn’t fit the V ML “learns” how to behave via training data
– ML software itself is just a framework The training data forms de facto requirements
– The “magic” of ML is being able to skip requirements– But, how do we trace that to acceptance tests?
» We’ll need to argue training data or test data is “complete”
Is training data “complete” and correct? Training data is safety critical What if a moderately rare case isn’t trained?
– It might not behave as you expect– People’s perception of “almost the same”
does not necessarily predict ML responses!
How Do We Trace To Safety Standard “V”?REQUIREMENTSSPECIFICATION
SYSTEMSPECIFICATION
SUBSYSTEM/COMPONENT
SPECIFICATION
PROGRAMSPECIFICATION
MODULESPECIFICATION
SOURCECODE
UNIT TEST
PROGRAMTEST
SUBSYSTEM/COMPONENT
TEST
SYSTEMINTEGRATION
& TEST
ACCEPTANCETEST
VERIFICATION &TRACEABILITY
VALIDATION & TRACEABILITY
VERIFICATION &TRACEABILITY
VERIFICATION &TRACEABILITY
VERIFICATION &TRACEABILITY
Review
Review
Review
Review
Review
Review
Review
Review
Review
Review
Review
Cluster Analysis
?
?
12© 2017 Edge Case Research LLC
Validating everything at once is infeasible Might need 1 billion to 1 trillion miles of road test Try breaking this up into two steps: requirements and implementation
Validate requirements: what is the correct behavior? Taxonomy of driving scenarios tracing data collection to test scenarios Taxonomy of edge cases tracing data collection to test scenarios Challenge: edge case for ML system might be different than edge case for a human driver
Validate implementation: does vehicle behave correctly? Test scenarios are only a start – what if system is brittle?
– Does it correctly identify which scenario it is experiencing?– Robustness testing can help identify problems
It’s not all ML; use traditional safety for the rest (ISO 26262)– Might use non-ML software to ensure ML safety
Validation: Requirements vs. Implementation
DISTRIBUTION A – NREC case number STAA-2013-10-02
https://goo.gl/g3t3P9
13© 2017 Edge Case Research LLC
Safety Envelope: Specify unsafe regions for safety Specify safe regions for functionality
– Deal with complex boundary via:» Under-approximate safe region» Over-approximate unsafe region
Trigger system safety responseupon transition to unsafe region
Partition the requirements: Operation: functional requirements Failsafe: safety requirements (safety functions)
Safety Envelope Approach to ML Deployment
14© 2017 Edge Case Research LLC
“Doer” subsystem Implements normal functionality Allocate functional requirements to Doer
– Put all the Machine Learning content here!
“Checker” subsystem – Traditional SW Implements failsafes (safety functions) Allocate safety requirements to Checker
– Use traditional software safety techniques
Checker is entirely responsible for safety Doer can be at low Safety Integrity Level (SIL)
– Failure is lack of availability, not a mishap Checker must be at high SIL (failure is unsafe), e.g., ISO 26262
– Often, Checker can be much simpler than Doer
Architecting A Safety Envelope SystemDoer/Checker Pair
Low SIL
High SILSimpleSafetyEnvelopeChecker
ML
APD (Autonomous Platform Demonstrator)How did we make this scenario safe?
TARGET GVW: 8,500 kg TARGET SPEED: 80 km/hr
Approved for Public Release. TACOM Case #20247 Date: 07 OCT 2009
The Autonomous Platform Demonstrator (APD) was the first UGV to use a Safety Monitor as part of its safety case.
As a result, the U.S. Army approved APD for demonstrations involving soldier participation.
U.S. Army cites high quality of APD safety case and turns to NREC to improve the safety of unmanned vehicles.
Approved for Public Release – Distribution is Unlimited (NREC case #: STAA-2012-10-17)
Fail-SafeApproach:
Fail-OperationalApproach:
17© 2017 Edge Case Research LLC
ASTAA: Automated Stress Testing of Autonomy Architectures Key idea: combination of exceptional &
normal inputs to an interface Builds on 20 years of research experience
– Ballista OS robustness testing in 1990s … to …– Autonomous vehicle testing starting in 2010
Example: Ground Vehicle CAN interception: Test Injector
– Selectively modifies CAN messages on the fly– Modification based on data type information
Invariant monitor– Reads messages for invariant evaluation– “Checker” invariant monitor detects failures
Commercial tool build-out: Edge Case Research Switchboard (software & hardware interface testing)
Robustness Testing
DISTRIBUTION A – NREC case number STAA-2013-10-02
18© 2017 Edge Case Research LLC
Improper handling of floating-point numbers: Inf, NaN, limited precision
Array indexing and allocation: Images, point clouds, etc… Segmentation faults due to arrays that are too small Many forms of buffer overflow with complex data types Large arrays and memory exhaustion
Time: Time flowing backwards, jumps Not rejecting stale data
Problems handling dynamic state: For example, lists of perceived objects or command trajectories Race conditions permit improper insertion or removal of items Garbage collection causes crashes or hangs
Robustness Testing Finds Problems
DISTRIBUTION A – NREC case number STAA-2013-10-02
19© 2017 Edge Case Research LLC
Good job of advocating for progress E.g., recommends ISO 26262 safety standard
But, some areas need work, such as: Machine Learning validation is immature Need independent safety assessment Re-certification for all updates to critical
components Which long-tail exceptional situations are
“reasonable” to mitigate Crash data recorders must have forensic validity
Open response to NHTSA at: https://goo.gl/ZumJss https://betterembsw.blogspot.com/2016/10/draft-response-to-
dot-policy-on-highly.html
Comments On NHTSA Policy
20© 2017 Edge Case Research LLC
Multi-prong approach required for validation Need rigorous engineering, not just vehicle testing Account for heavy-tail distribution of scenarios
– Too many to just learn them all?– How does system recognize it doesn’t know a scenerio?
Unique Machine Learning validation challenges How do we create “requirements” for an ML system? How do we ensure that testing traces to the ML training data? How do we ensure adequate requirements and testing coverage for the real world?
Two techniques that we’re putting into practice Safety monitor: let ML optimize behavior while guarding against the unexpected Robustness testing: inject faults into system building blocks to uncover faults
Conclusions
[General Motors]
21© 2016 Edge Case Research LLC
Specialists in Making Robust Autonomy Software Embedded software quality assessment and improvement Embedded software skill evaluation and training Lightweight software processes for mission critical systems Lightweight automation for process & technical area improvement
[email protected] http://www.edge-case-research.com/
Top Related