1 arXiv:1812.05389v1 [cs.SE] 13 Dec 2018

Safely Entering the Deep: A Review of Verification and Validation for MachineLearning and a Challenge Elicitation in the Automotive Industry

Markus Borg 1 Cristofer Englund 1, Krzysztof Wnuk 2, Boris Duran 1, Christoffer Levandowski 3,Shenjian Gao 2, Yanwen Tan 2, Henrik Kaijser 4, Henrik Lonn 4, Jonas Tornqvist 3

1 RISE Research Institutes of Sweden AB,Scheelevagen 17,

SE-223 70 Lund, SwedenE-mail: {markus.borg, cristofer.englund, boris.duran}@ri.se

2 Blekinge Institute of Technology,Valhallavagen 1,

SE-371 41 Karlskrona, SwedenE-mail: [email protected], {shga13, yata13}@student.bth.se

3 QRTECH AB,Flojelbergsgatan 1C,

SE-431 35 Molndal, SwedenE-mail: {christoffer.levandowski, jonas.tornqvist}@qrtech.se

4 AB Volvo,Volvo Group Trucks Technology,SE-405 08 Gothenburg, Sweden

E-mail: {henrik.kaijser, henrik.lonn}@volvo.se

AbstractDeep Neural Networks (DNN) will emerge as a cornerstone in automotive software engineering. How-

ever, developing systems with DNNs introduces novel challenges for safety assessments. This paper reviewsthe state-of-the-art in verification and validation of safety-critical systems that rely on machine learning.Furthermore, we report from a workshop series on DNNs for perception with automotive experts in Sweden,confirming that ISO 26262 largely contravenes the nature of DNNs. We recommend aerospace-to-automotiveknowledge transfer and systems-based safety approaches, e.g., safety cage architectures and simulated systemtest cases.

Keywords: deep learning, safety-critical systems, machine learning, verification and validation,ISO 26262

1. Introduction

As an enabling technology for autonomous driving,Deep learning Neural Networks (DNN) will emerge

arX

iv:1

812.

0538

9v1

[cs

.SE

] 1

3 D

ec 2

018

M. Borg et al. / Safely Entering the Deep

as a cornerstone in automotive software engineer-ing. Automotive software solutions using DNNs isa hot topic, with new advances being reported al-most weekly. Also in the academic context, severalresearch communities study DNNs in the automo-tive domain from various perspectives, e.g., appliedMachine Learning (ML)75, software engineering28,safety engineering78, and Verification & Validation(V&V)50.

DNNs are used to enable vehicle environmentalperception, i.e., awareness of elements in the sur-rounding traffic. Successful perception is a prerequi-site for autonomous features such as lane departuredetection, path/trajectory planning, vehicle tracking,behavior analysis, and scene understanding111 – anda prerequisite to reach levels 3-5 as defined by SAEInternational’s levels of driving automation. A widerange of sensors have been used to collect input datafrom the environment, but the most common ap-proach is to rely on front-facing cameras34. In recentyears, DNNs have demonstrated their usefulness inclassifying such camera data, which in turn has en-abled both perception and subsequent breakthroughstoward autonomous driving54.

From an ISO 26262 safety assurance perspec-tive, however, developing systems based on DNNsconstitutes a major paradigm shift compared to con-ventional systems∗28. Andrej Karpathy, Director ofAI at Tesla, boldly refers to the new era as “Soft-ware 2.0”†. No longer do human engineers explicitlydescribe all system behavior in source code, insteadDNNs are trained using enormous amounts of his-torical data.

DNNs have been reported to deliver superhumanclassification accuracy for specific tasks35, but in-evitably they will occasionally fail to generalize93.Unfortunately, from a safety perspective, analyz-ing when this might happen is currently not possi-ble due to the black-box nature of DNNs. A state-of-the-art DNN might be composed of hundreds ofmillions of parameter weights, thus the methodsfor V&V of DNN components must be differentcompared to approaches for human readable source

code. Techniques enforced by ISO 26262 such assource code reviews and exhaustive coverage testingare not applicable78.

The contribution of this review paper is twofold.First, we describe the state-of-the-art in V&V ofsafety-critical systems that rely on ML. We surveyacademic literature, partly through a reproduciblesnowballing review 103, i.e., establishing a body ofliterature by tracing referencing and referenced pa-pers. Second, we elicit the most pressing challengeswhen engineering safety-critical DNN componentsin the automotive domain. We report from work-shops with automotive experts, and we validate find-ings from the literature review through an industrialsurvey. The research has been conducted as part ofSMILE‡, a joint research project between RISE AB,Volvo AB, Volvo Cars, QRTech AB, and SemconAB.

The rest of the paper is organized as fol-lows: Section 2 presents safety engineering con-cepts within the automotive domain and introducesthe fundamentals of DNNs. Section 3 describes theproposed research method, including four sources ofempirical evidence, and Section 4 reports our find-ings. Section 5 presents a synthesis targeting ourtwo objectives, and discusses implications for re-search and practice. Finally, Section 6 concludes thepaper and outlines the most promising directions forfuture work. Throughout the paper, we use the nota-tion [PX] to explicitly indicate publications that arepart of the snowballing literature study.

2. Background

This section first presents development of safety-critical software according to the ISO 26262standard44. Second, we introduce fundamentals ofDNNs, required to understand how it could allowvehicular perception. In the remainder of this paper,we adhere to the following three definitions relatedto safety-critical systems:

• Safety is “freedom from unacceptable risk ofphysical injury or of damage to the health of

∗ by conventional systems we mean any system that does not have the ability to learn or improve from experience† https://medium.com/@karpathy/software-2-0-a64152b37c35‡ The SMILE project: Safety analysis and verification/validation of MachIne LEarning based systems


people”43

• Robustness is “the degree to which a componentcan function correctly in the presence of invalidinputs or stressful environmental conditions”42

• Reliability is “the probability that a componentperforms its required functions for a desired pe-riod of time without failure in specified environ-ments with a desired confidence”11

2.1. Safety Engineering in the AutomotiveDomain: ISO 26262

Safety is not a property that can be added at the endof the design. Instead, it must be an integral part ofthe entire engineering process. To successfully engi-neer a safe system, a systematic safety analysis anda methodological approach to managing risks arerequired8. Safety analysis comprises identificationof hazards, development of approaches to eliminatehazards or mitigate their consequences, and verifica-tion that the approaches are in place in the system.Risk assessment is used to determine how safe a sys-tem is, and to analyze alternatives to lower the risksin the system.

Safety has always been an important concernin engineering, and best practices have often beencollected in governmental or industry safety stan-dards. Common standards provide a common vo-cabulary as well as a way for both internal and ex-ternal safety assessment, i.e., work tasks for bothengineers working in the development organiza-tion and for independent safety assessors from cer-tification bodies. For software-intensive systems,the generic meta-standard IEC 6150843 introducesthe fundamentals of functional safety for Elec-trical/Electronic/Programmable Electronic (E/E/PE)Safety-related Systems, i.e., hazards caused bymalfunctioning E/E/PE systems rather than non-functional considerations such as fire, radiation, andcorrosion. Several different domains have their ownadaptations of IEC 61508.

ISO 2626244 is the automotive derivative ofIEC 61508, organized into 10 parts, constituting acomprehensive safety standard covering all aspectsof automotive development, production, and main-tenance of safety-related systems. V&V are core ac-

tivities in safety-critical development and thus dis-cussed in detail in ISO 26262, especially in Part 4:Product development at the system level and Part6: Product development at the software level. Thescope of the current ISO 26262 standard is seriesproduction passenger cars with a max gross weightof 3,500 kg. However, the second edition of thestandard, expected in the beginning of 2019, willbroaden the scope to cover also trucks, buses, andmotorcycles.

The automotive safety lifecycle (ASL) is one keycomponent of ISO 2626266, defining fundamentalconcepts such as safety manager, safety plan, andconfirmation measures including safety review andaudit. The ASL describes six phases: manage-ment, development, production, operation, service,and decommission. Assuming that a safety-criticalDNN will be considered a software unit, especiallythe development phase on the software level (Part6) mandates practices that will require special treat-ment. Examples include verification of softwareimplementation using inspections (Part 6:8.4.5) andconventional structural code coverage metrics (Part6:9.4.5). It is evident that certain ISO 26262 pro-cess requirements cannot apply to ML-based soft-ware units, in line with how model-based develop-ment is currently partially excluded.

Another key component of ISO 26262 is the au-tomotive safety integrity level (ASIL). In the begin-ning of the ASL development phase, a safety anal-ysis of all critical functions of the system is con-ducted, with a focus on hazards. Then a risk anal-ysis combining 1) the probability of exposure, 2)the driver’s possible controllability, and 3) the pos-sible severity of the outcome, results in an ASILbetween A and D. ISO 26262 enforces develop-ment and verification practices corresponding to theASIL, with the most rigorous practices required forASIL D. Functions that are not safety-critical, i.e.,below ASIL A, are referred to as ‘QM’ as no morethan the normal quality management process is en-forced.


2.2. Deep Learning for Perception: Approachesand Challenges

While there currently is a deep learning hype, thereis no doubt that the technique has produced ground-breaking results in various fields – by clever utiliza-tion of the increased processing power in the lastdecade, nowadays available in inexpensive GPUs,combined with the ever-increasing availability ofdata.

Deep learning is enabled by DNNs, which area kind of Artificial Neural Networks (ANN). Tosome extent inspired by biological connectomes,i.e., mappings of neural connections such as in thehuman brain, ANNs composed of connected layersof neurons are designed to learn to perform classi-fication tasks. While ANNs have been studied fordecades, significant breakthroughs came when theincreased processing power allowed adding moreand more layers of neurons – which also increasedthe number of connections between neurons by or-ders of magnitude. The exact number of layers,however, needed for a DNN to qualify as deep isdebatable.

A major advantage of DNNs is that the classifieris less dependent on feature engineering, i.e., usingdomain knowledge to (perhaps manually) identifyproperties in data for ML to learn from – this is of-ten difficult. Examples of operations used to extractfeatures in computer vision include: color analysis,edge extraction, shape matching, and texture analy-sis. What DNNs instead introduced was an ML so-lution that learned those features directly from inputdata, greatly decreasing the need for human featureengineering. DNNs have been particularly success-ful in speech recognition, computer vision, and textprocessing – areas in which ML results were lim-ited by the tedious work required to extract effectivefeatures.

In computer vision, essential for vehicular per-ception, the state-of-the-art is represented by a spe-cial class of DNNs known as Convolutional NeuralNetworks (CNN) 36,95,88,27. Since 2010, several ap-proaches based on CNNs have been proposed – andin only five years of incremental research the bestCNNs matched the image classification accuracyof humans. CNN-based image recognition is now

reaching the masses, as companies like Nvidia, In-tel, etc. are now commercializing specialized hard-ware with automotive applications in mind such asthe Drive PX series. Success stories in the auto-motive domain include lane keeping applications forself-driving cars14,79.

Generative Adversarial Networks (GAN) is an-other approach in deep learning research that is cur-rently receiving considerable interest30,74. In con-trast to discriminative networks (what has been dis-cussed so far) that learn boundaries between classesin the data for the purpose of classification, a gen-erative network can instead be used to learn theprobability of features given a specific class. Thus,a GAN could be used to generate samples from alearned network – which could possibly be used toexpand available training data with additional syn-thetic data. GANs can also be used to generate ad-versarial examples, i.e., inputs to ML classifiers in-tentionally created to cause misclassification.

Finally, successful applications of DNNs rely onthe availability of large labeled datasets from whichto learn features. In many cases, such labels arelimited or does not exist at all. To maximize theutility of the labeled data, truly hard currency foranyone engineering ML-based systems, techniquessuch as transfer learning are used to adapt knowl-edge learned from one dataset to another domain29.

3. Research method

The overarching goal of the SMILE project is todevelop approaches to V&V of ML-based systems,more specifically automotive applications relying onDNNs. Our current paper is guided by two researchquestions:

RQ1 What is the state-of-the-art in V&V of ML-based safety-critical systems?

RQ2 What are the main challenges when engineer-ing safety-critical systems with DNN componentsin the automotive domain?

Fig. 1 shows an overview of the research, di-vided into three sequential parts (P1-P3). Each partconcluded with a Milestone (I–III). In Fig. 1, tasks


driven by academia (or research institutes) are pre-sented in the light gray area – primarily addressingRQ1. Tasks in the darker gray area above, are pri-marily geared toward collecting data in the light ofRQ2, and mostly involve industry practitioners. Thedarkest gray areas denote involvement of practition-ers that were active in safety-critical developmentbut not part of the SMILE project.

Fig. 1. Overview of the SMILE project and its three mile-stones. The figure illustrates the joint industry/academia na-ture of SMILE, indicated by light gray background for tasksdriven by academia and darker gray for tasks conducted bypractitioners.

In the first part of the project (P1 in Fig. 1),we initiated a systematic snowballing review of aca-demic literature to map the state-of-the-art. In par-allel, we organized a workshop series with domainexperts from industry with monthly meetings to alsoassess the state-of-practice in the Swedish automo-tive industry. The literature review was seeded bydiscussions from the project definition phase (a).Later, we shared intermediate findings from the lit-erature review at workshop #4 (b) and final resultswere brought up to discussion at workshop #6 (c).The first part of the project concluded with Mile-stone I: a collection of industry perspectives.

The second part of the SMILE project (P2 inFig. 1) involved an analysis of the identified liter-ature (d). We extracted challenges and solution pro-posals from the literature, and categorized them ac-cording to a structure that inductively emerged dur-

ing the process (see Section 3.1). Subsequently, wecreated a questionnaire-based survey to validate ourfindings and to receive input from industry prac-titioners beyond SMILE (e). The second phaseconcluded with analyzing the survey data at Mile-stone II.


In the third part of the project (P3 in Fig. 1), wecollected all results (f), and performed a synthesis(g). Finally, writing this article concludes the re-search at Milestone III.

Fig. 2 shows an overview of the SMILE projectfrom an evidence perspective. The collection ofempirical evidence was divided into two indepen-dent tracks resulting in four sets of evidence, re-flecting the nature of the joint academia/industryproject. Furthermore, the split enabled us to bal-ance the trade-off between rigor and relevance thatplagues applied research projects45.

As shown in the upper part of Figure 2, theSMILE consortium performed (non-replicable, fromnow on: “ad hoc”) searching for related work. Anearly set of papers was used to seed the systematicsearch described in the next paragraph. The findingsin the body of related work (cf. A in Fig. 2) werediscussed at the workshops. The workshops serveddual purposes, they collected empirical evidence ofpriorities and current needs in the Swedish automo-tive industry (cf. B in Fig. 2), and they validatedthe relevance of the research identified through thead hoc literature search. The upper part focused onmaximizing industrial relevance, at the expense ofrigor, i.e., we are certain that the findings are rel-evant to the Swedish automotive industry, but theresearch was conducted in an ad hoc fashion withlimited traceability and replicability. The right partof Figure 2 complements the practice-oriented re-search of the SMILE project by a systematic liter-ature review, adhering to an established process103.The identified papers (cf. C in Fig. 2)) were sys-tematized and the result was validated through aquestionnaire-based survey. The survey also actedas a means to collect additional primary evidence,as we collected practitioners’ opinions on V&V ofML-based systems in safety-critical domains (cf. Din Fig. 2). Thus, the lower part focused on maximiz-ing academic rigor.

Fig. 2. Overview of the SMILE project from an evidenceperspective. We treat the evidence as four different sets: A.Related work and C. Snowballed literature represent sec-ondary evidence, whereas B. Workshop findings and D.Survey responses constitute primary evidence.

3.1. The systematic review

Inspired by evidence-based medicine, systematic lit-erature reviews have become a popular software en-gineering research method to aggregate work in aresearch area. Snowballing literature reviews103 isan alternative to more traditional database searchesrelying on carefully developed search strings, par-ticularly suitable when the terminology used in thearea is diverse, e.g., in early stages of new researchtopics. This section describes the two main phasesof the literature review: 1) paper selection and 2)data extraction and analysis.

3.1.1. Paper selection

As safety-critical applications of DNNs in the auto-motive sector is still a new research topic, we de-cided to broaden our literature review to encompassalso other types of ML, and also to go beyond the au-tomotive sector. We developed the following crite-ria: for a publication to be included in our literaturereview, it should describe 1) engineering of an ML-based system 2) in the context of autonomous cyber-physical systems, and 3) the paper should addressV&V or safety analysis. Consequently, our crite-ria includes ML beyond neural networks and DNNs.Our focus on autonomous cyber-physical systemsimplicitly restricts our scope to safety-critical sys-tems. Finally, we exclude papers that do not target


V&V or safety analysis, but instead other engineer-ing considerations, e.g., requirements engineering,software architecture, or implementation issues.

First, we established a start set using exploratorysearching in Google Scholar and applying our in-clusion criteria. By combining various search termsrelated to ML, safety analysis, and V&V identifiedduring the project definition phase of the workshopseries (cf. a) in Fig 1), we identified 14 papers rep-resenting a diversity of authors, publishers, and pub-lications venues, i.e., adhering to recommendationsfor a feasible start set103. Still, the composition ofthe start set is a major threat to the validity of anysnowballing literature review. Table 5 shows the pa-pers in the start set.

Originating in the 14 papers in the start set, weiteratively conducted backward and forward snow-balling. Backward snowballing means scanning thereference lists for additional papers to include. For-ward snowballing from a paper involves adding re-lated papers that cite the given paper. We refer toone combined effort of backward and forward snow-balling as an iteration. In each iteration, two re-searchers collected candidates for inclusion and twoother researchers validated the selection using theinclusion criteria. Despite our efforts to carefullyprocess iterations, there is always a risk that rel-evant publications could not be identified by fol-lowing references from our start set due to citationpatterns in the body of scientific literature, e.g., re-search cliques.

3.1.2. Data extraction and analysis

When the snowballing was completed, two authorsextracted publication metadata according to a pre-defined extraction form, e.g., publication venue andapplication domain. Second, the same two authorsconducted an assessment of rigor and relevance asrecommended by Ivarsson and Gorschek45. Third,they addressed RQ1 using thematic analysis26, i.e.,summarizing, integrating, combining, and compar-ing findings of primary studies to identify patterns.

Our initial plan was to classify challenges andsolution proposals in previous work using classifica-tion schemes developed by Amodei et al. [P2] andVarshney101, respectively. However, neither of the

two proposed categorization schemes were success-ful in spanning the content of the selected papers.To better characterize the selected body of research,we inductively created new classification schemesfor challenges and solution proposals according to agrounded theory approach. Table 1 defines the finalcategories used in our study, seven challenge cate-gories and five solution proposal categories.

3.2. The questionnaire-based survey

To validate the findings from the snowballed liter-ature (cf. C. in Figure 2), we designed a web-based questionnaire to survey practitioners in safety-critical domains. Furthermore, reaching out to addi-tional practitioners beyond the SMILE project en-ables us to collect more insights into challengesrelated to ML-based systems in additional safety-critical contexts (cf. D. in Fig. 2). Moreover, weused the survey to let the practitioners rate the im-portance of the challenges reported in the academicliterature, as well as the perceived feasibility of thepublished solutions proposals.

We designed the survey instrument using GoogleForms, structured as 10 questions organized intotwo sections. The first section consisted of sevenclosed-end questions related to demographics of therespondents and their organizations and three Lik-ert items concerning high-level statements on V&Vof ML-based systems. The second section consistedof three questions: 1) rating the importance of thechallenge categories, 2) rating how promising thesolution proposal categories are, and 3) an open-endfree-text answer requesting a comment on our mainfindings and possibly adding missing aspects.

We opted for an inclusive approach and usedconvenience sampling to collect responses76, i.e., anon-probabilistic sampling method. The target pop-ulation was software and systems engineering prac-titioners working in safety-critical contexts, includ-ing both engineering and managerial roles, e.g., testmanagers, developers, architects, safety engineers,and product managers. The main recruitment strat-egy was to invite the extended SMILE network (cf.workshops #5 and #6 in Fig. 1) and to advertise thesurvey invitation in LinkedIn groups related to de-velopment of safety-critical systems. We collected


Table 1: Definition of categories of challenges and solution proposals for V&V of ML-based systems.Challenge Categories DefinitionsState-space explosion Challenges related to the very large size of the input space.Robustness Issues related to operation in the presence of invalid inputs or stressful envi-

ronmental conditions.Systems engineering Challenges related to integration or co-engineering of ML-based and conven-

tional components.Transparency Challenges originating in the black-box nature of the ML system.Requirements specification Problems related to specifying expectations on the learning behavior.Test specification Issues related to designing test cases for ML-based systems, e.g., non-

deterministic output.Adversarial attacks Threats related to antagonistic attacks on ML-based systems, e.g., adversarial

examples.

Solution Proposal Categories DefinitionsFormal methods Approaches to mathematically prove that some specification holds.Control theory Verification of learning behavior based on automatic control and self-adaptive

systems.Probabilistic methods Statistical approaches such as uncertainty calculation, Bayesian analysis, and

confidence intervals.Test case design Approaches to create effective test cases, e.g., using genetic algorithms or pro-

cedural generation.Process guidelines Guidelines supporting work processes, e.g., covering training data collection

or testing strategies.


answers in 2017, from July 1 to August 31.As a first step of the response analysis, we per-

formed a content sanity check to identify invalid an-swers, e.g., nonsense or careless responses. Subse-quently, we collected summary statistics of the re-sponses and visualized it with bar charts to get aquick overview of the data. We calculated Spear-man rank correlation (ρ) between all ordinal scaleresponses, interpreting correlations as weak, moder-ate, and strong for ρ > 0.3, ρ > 0.5, and ρ > 0.7,respectively. Finally, the two open-ended questionswere coded, summarized, and validated by four ofthe co-authors.

4. Results and Discussion

This section is organized according to the evidenceperspective provided in Fig. 2: A. Related work, B.Workshop findings, C. Snowballed literature, and D)Survey responses. As reported in Section 3, A. andB. focus on industrial relevance, whereas C. and D.aim at academic rigor.

4.1. Related work

The related work section (cf. A. in Fig. 2) presentsan overview of literature that was identified duringthe SMILE project. Fourteen of the papers were se-lected early to seed the (independent) snowballingliterature review described in Section 3.1. In thissection, we first describe the start set [P1]-[P14],and then papers that were subsequently identified bySMILE members or the anonymous reviewers of themanuscript – but not through the snowballing pro-cess (as these are reported separately in Section 4.3).

4.1.1. The snowballing start set

The following 14 papers were selected as the snow-balling start set, representing a diverse set of au-thors, publication venues, and publication years. Webriefly describe them below, and motivate their in-clusion in the start set.

[P1] Clark et al. reported from a US Air Forceresearch project on challenges in V&V of au-tonomous systems. This work is highly related tothe SMILE project.

[P2] Amodei et al. listed five challenges to artificialintelligence safety according to Google Brain: 1)avoiding negative side effects, 2) avoiding rewardhacking, 3) scalable oversight, 4) safe exploration,and 5) robustness to distributional shift.

[P3] Brat and Jonsson discussed challenges in V&Vof autonomous systems engineered for space ex-ploration. Included to cover the space domain.

[P4] Broggi et al. presented extensive testing of theBRAiVE autonomous vehicle prototype by driv-ing from Italy to China. Included as it is different,i.e., reporting experiences from a practical trip.

[P5] Taylor et al. sampled research in progress(in 2003) on V&V of neural networks, aimedat NASA applications. Included to snowball re-search conducted in the beginning of the millen-nium.

[P6] Taylor et al. with the Machine Intelligence Re-search Institute surveyed design principles thatcould ensure that systems behave in line with theinterests of their operators – which they refer toas “AI alignment”. Included to bring in a morephilosophical perspective on safety.

[P7] Carvalho et al. presented a decade of researchon control design methods for systematic handlingof uncertain forecasts for autonomous vehicles.Included to cover robotics.

[P8] Ramos et al. proposed a DNN-based obstacledetection framework, providing sensor fusion fordetection of small road hazards. Included as thework closely resembles the use case discussed atthe workshops (see Section 4.2).

[P9] Alexander et al. suggested “situation coveragemethods” for autonomous robots to support test-ing of all environmental circumstances. Includedto cover coverage.

[P10] Zou et al. discussed safety assessments ofprobabilistic airborne collision avoidance systemsand proposes a genetic algorithm to search for un-desired situations. Included to cover probabilisticapproaches.

[P11] Zou et al. presented a safety validation ap-proach for avoidance systems in unmanned aerialvehicles, using evolutionary search to guide simu-lations to potential conflict situations in large state


spaces. Although the authors overlap, included tosnowball research on simulation.

[P12] Arnold and Alexander proposed using proce-dural content generation to create challenging en-vironmental situations when testing autonomousrobot control algorithms in simulations. Includedto cover synthetic test data.

[P13] Sivaraman and Trivedi compared three activelearning approaches for on-road vehicle detection.Included to add a semi-supervised ML approach.

[P14] Mozaffari et al. developed a robust safety-oriented autonomous cruise controller based onthe model predictive control technique. Includedto identify approaches based on control theory.

In the start set, we consider [P1] to be theresearch endeavor closest to our current study.While we target the automotive domain rather thanaerospace, both studies address highly similar re-search objectives – and also the method used toexplore the topic is close to our approach. [P1]describes a year-long study aimed at: 1) under-standing the unique challenges to the certificationof safety-critical autonomous systems and 2) iden-tifying the V&V approaches needed to overcomethem. To accomplish this, the US Air Force or-ganized three workshops with representatives fromindustry, academia, and governmental agencies, re-spectively. [P1] concludes that that there are fourenduring problems that must be addressed:

• State-Space Explosion – In an autonomous sys-tem, the decision space is non-deterministic andthe system might be continuously learning. Thus,over time, there may be several output signals foreach input signal. This in turn makes it inherentlychallenging to exhaustively search, examine, andtest the entire decision space.

• Unpredictable Environments – Conventional sys-tems have limited ability to adapt to unantici-pated events, but an autonomous systems shouldrespond to situations that were not programmed atdesign time. However, there is a trade-off betweenperformance and correct behavior, which exacer-bates the state-space explosion problem.

• Emergent Behavior – Non-deterministic andadaptive systems may induce behavior that result

in unintended consequences. Challenges com-prise how to understand all intended and unin-tended behavior and how to design experimentsand test vectors that are applicable to adaptive de-cision making in an unpredictable environment.

• Human-Machine Communication – Hand-off,communication, and cooperation between the op-erator and the autonomous system play an impor-tant role to create mutual trust between the humanand the system. It is not known how to addressthese issues when the behavior is not known atdesign time.

With these enduring challenges in mind, [P1]calls for research to pursue five goals in future tech-nology development. First, approaches to cumula-tively build safety evidence through the phases ofResearch & Development (R&D), Test & Evaluation(T&E), and Operational Tests. The US Air Forcecalls for effective methods to reuse safety evidencethroughout the entire product development lifecycle.Second, [P1] argues that formal methods, embeddedduring R&D, could provide safety assurance. Thisapproach could reduce the need for T&E and oper-ational tests. Third, novel techniques to specify re-quirements based on formalism, mathematics, andrigorous natural language could bring clarity andallow automatic test case generation and automatedtraceability to low-level designs. Fourth, run-timedecision assurance may allow restraining the behav-ior of the system, thus shifting focus from off-lineverification to instead performing on-line testing atrun-time. Fifth, [P1] calls for research on compo-sitional case generation, i.e., better approaches tocombine different pieces of evidence into one com-pelling safety case.

4.1.2. Non-snowballed related work

This subsection reports the related work that stirredup the most interesting discussions in the SMILEproject. In contrast to the snowballing literature re-view, we do not provide steps to replicate the identi-fication of the following papers.

Knauss et al. conducted an exploratory inter-view study to elicit challenges when engineering au-tonomous cars51. Based on interviews and focus


groups with 26 domain experts in five countries, theauthors report in particular challenges in testing au-tomated vehicles. Major challenges are related to:1) virtual testing and simulation, 2) safety, reliabil-ity, and quality, 3) sensors and their models 4) com-plexity of, and amount of, test cases, and 5) hand-offbetween driver and vehicle.

Spanfelner et al. conducted research on safetyand autonomy in the ISO 26262 context93. Theirconclusion is that driver assistance systems needmodels to be able to interpret the surrounding envi-ronment, i.e., to enable vehicular perception. Sincemodels, by definition, are simplifications of the realworld, they will be subject to functional insufficien-cies. By accepting that such insufficiencies may failto reach the functional safety goals, it is possible todesign additional measures that in turn can meet thesafety goals.

Heckemann et al. identified two primary chal-lenges in developing autonomous vehicles adheringto ISO 2626237. First, the driver is today consideredto be part of the safety concept, but future vehicleswill make driving maneuvers without interventionsby a human driver. Second, the system complexityof modern vehicle systems is continuously growingas new functionality is added. This obstructs safetyassessment, as increased complexity makes it harderto verify freedom of faults.

Varshney discussed concepts related to engineer-ing safety for ML systems from the perspective ofminimizing risk and epistemic uncertainty101, i.e.,uncertainty due to gaps in knowledge as opposed tointrinsic variability in the products. More specif-ically, he analyzed how four general strategies forpromoting safety64 apply to systems with ML com-ponents. First, inherently safe design means exclud-ing a potential hazard from the system instead ofcontrolling it. A prerequisite for assuring such adesign is to improve the interpretability of the typ-ically opaque ML models. Second, safety reservesmeans the factor of safety, e.g., the ratio of absolutestructural capacity to actual applied load in struc-tural engineering. In ML, interpretations include afocus on a the maximum error of classifiers insteadof the average error, or training models to be robustto adversarial examples. Third, safe fail implies that

a system remains safe even when it fails in its in-tended operation, traditionally by relying on con-structs such as electrical fuses and safety valves. InML, a concept of run-time monitoring must be ac-complished, e.g., by continuously monitoring howcertain a DNN model is performing in its classifica-tion task. Fourth, procedural safeguards covers anysafety measures that are not designed into the sys-tem, e.g., mandatory safety audits, training of per-sonnel, and user manuals describing how to definethe training set.

Seshia et al. identified five major challengesto achieve formally-verified AI-based systems 86.First, a methodology to provide a model of the en-vironment even in the presence of uncertainty. Sec-ond, a precise mathematical formulation of what thesystem is supposed to do, i.e., a formal specification.Third, the need to come up with new techniques toformally model the different components that willuse machine learning. Fourth, systematically gener-ating training and testing data for ML-based compo-nents. Finally, developing computationally scalableengines that are able to verify quantitatively the re-quirements of a system.

One approach to tackle the opaqueness of DNNsis to use visualization. Bojarski et al.13 developed atool for visualizing the parts of an image that areused for decision making in vehicular perception.Their tool demonstrated an end-to-end driving ap-plication where the input is images and the outputis the steering angle. Mhamdi et al. also studiedthe black box aspects of neural networks, and showthat the robustness of a complete DNN can be as-sessed by an analysis focused on individual neuronsas units of failure62 – a much more reasonable ap-proach given the state-space explosion.

In a paper on ensemble learning, Varshney et al.describes a reject option for classifiers102. Such aclassifier could, instead of presenting a highly un-certain classification, request that a human operatormust intervene. A common assumption is that theclassifier is the least confident in the vicinity of thedecision boundary, i.e., that there is an inverse re-lationship between distance and confidence. Whilethis might be true in some parts of the feature space,it is not a reliable measure in parts that contain too


few training examples. For a reject option to providea “safe fail” strategy, it must trigger both 1) near thedecision boundary in parts of the feature space withmany training examples, and 2) in any decision rep-resented by too few training examples.

Heckemann et al. proposed using the concept ofadaptive safety cage architectures to support futureautonomy in the automotive domain37, i.e., an inde-pendent safety mechanism that continuously moni-tors sensor input. The authors separated two areasof operation: a valid area (that is considered safe)and an invalid area that can lead to hazardous situ-ations. If the function is about to enter the invalidarea, the safety cage will invoke an appropriate safeaction, such as a minimum risk emergency stoppingmaneuver or a graceful degradation. Heckemannet al. argued that a safety cage can be used in anASIL decomposition by acting as a functionally re-dundant system to the actual control system. Thehighly complex control function could then be de-veloped according to the quality management stan-dard, whereas the comparably simple safety cagecould adhere to a higher ASIL level.

Adler et al. presented a similar run-time mon-itoring mechanism for detecting malfunctions, re-ferred to as a safety supervisor1. Their safety su-pervisor is part of an overall safety approach for au-tonomous vehicles, consisting of a structured four-step method to identify the most critical combina-tions of behaviors and situations. Once the criti-cal combinations have been specified, the authorspropose implementing tailored safety supervisors tosafeguard against related malfunctions.

Finally, a technical report prepared by Bhat-tacharyya et al. for the NASA Langley Re-search Center discussed certification considerationsof adaptive systems in the aerospace domain10. Thereport separates adaptive control algorithms and Ar-tificial Intelligence (AI) algorithms, and the latter isclosely related to our study since it covers machinelearning and ANN. Their certification challenges foradaptive systems are organized in four categories:

• Comprehensive requirements – Specifying a setof requirements that completely describe the be-havior, as mandated by current safety standards, ispresented as the most difficult challenge to tackle.

• Verifiable requirements – Specifying pass criteriafor test cases at design-time might be hard. Also,current aerospace V&V relies heavily on coveragetesting of source code in imperative languages, buthow to interpret that for AI algorithms is unclear.

• Documented design – Certification requires de-tailed documentation, but components realizingadaptive algorithms were rarely developed withthis in mind. Especially AI algorithms are oftendistributively developed by open source commu-nities, which makes it hard to reverse engineerdocumentation and traceability.

• Transparent design – Regulators expect a trans-parent design and a conventional implementationto be presented for evaluation. Increasing sys-tem complexity by introducing novel adaptive al-gorithms challenges comprehensibility and trust.On top of that, adaptive systems are often non-deterministic, which makes it harder to demon-strate absence of unintended functionality.

4.2. The Workshop Series

During the six workshops with industry partners (cf.#1-#6 in Fig. 1), we discussed key questions thatmust be explored to enable engineering of safety-critical automotive systems with DNNs. Three sub-areas emerged during the workshops: 1) robustness,2) interplay between DNN components and conven-tional software, and 3) V&V of DNN components.

4.2.1. Robustness of DNN Components

The concept of robustness permeated most discus-sions during the workshops. While robustness istechnically well-defined, in the workshops it oftenremained a rather elusive quality attribute – typicallytranslated to “something you can trust”.

To bring the workshop participants to the samepage, we found it useful to base the discussions ona simple ML case: a confusion matrix for a one-class classifier for camera-based animal detection.For each input image, the result of the classifier islimited to one of the four options: 1) an animal ispresent and correctly classified (true positive), 2) noanimal is present and the classifier does not signal


animal detection (true negative), 3) the classifier re-ports animal presence, but there is none (false pos-itive), and 4) an animal is present, but the classifiermisses it (false negative).

For the classifier to be considered robust, theparticipants stressed the importance of not gener-ating false positives and false negatives despite oc-casional low quality input or changes in the envi-ronmental conditions, e.g., dusk, rain, or sun glare.A robust ML system should neither miss presentanimals, risking collisions, nor suggest emergencybraking that risk rear-end collisions. As the impor-tance of robustness in the example is obvious, wesee a need for future research both on how to spec-ify and verify acceptable levels of ML robustness.

During the workshops, we also discussedmore technical aspects engineering robust DNNcomponents. First, our industry practitionersbrought up the issue of DNN architectures tobe problem-specific. While there are some ap-proaches to automatically generating neural networkarchitectures6,104, typically designing the DNN ar-chitecture is an ad hoc process of trial and error.Often a well known architecture is used as a base-line and then it is tuned to fit the problem at hand.Our workshops recognized the challenge of engi-neering robust DNN-based systems, in part due totheir highly problem-specific architectures.

Second, once the DNN architecture is set, train-ing commences to assign weights to the trainableparameters of the network. The selection of train-ing data must be representative for the task, in ourdiscussions animal detection, and for the environ-ment that the system will operate in. The work-shops agreed that robustness of DNN componentscan never be achieved without careful selection oftraining data. Not only must the amount and qualityof sensors (in our case cameras) acquiring the dif-ferent stimuli for the training data be sufficient, alsoother factors such as positioning, orientation, aper-ture, and even geographical location like city andcountry must match the animal detection example.At the workshops, we emphasized the issue of cam-

era positions as both car and truck manufacturerswere part of SMILE – to what extent can trainingdata from a car’s perspective be reused for a truck?Or should a truck rather benefit from its size and col-lect dedicated training data from its elevated cameraposition?

Third, also related to training data, the work-shops discussed working with synthetic data. Whilesuch data always can be used to complement train-ing data, there are several open questions on how tobest come up with the best mix during the trainingstage. As reported in Section 2.2, GANs30,74 couldbe a good tool for synthesizing data. Sixt et al.90

proposed a framework called RenderGAN that couldgenerate large amounts of realistic labeled data fortraining. In transfer learning, training efficiency im-proves by combining data from different data sets29,69. One possible approach could be to first trainthe DNN component using synthetic data from, e.g.,simulators like TORCS§, then data from some pub-licly available database could be used to continuethe training, e.g., the KITTI data¶or CityScape‖, andfinally, data from the geographical region where thevehicle should operate could be added. For any at-tempts at transfer learning, the workshops identi-fied the need to measure to what extent training datamatches the planned operational environment.

4.2.2. Complementing DNNs with ConventionalComponents

During the workshops, we repeatedly reminded theparticipants to consider DNNs from a systems per-spective. DNN components will always be part of anautomotive system consisting of also conventionalhardware and software components.

Several researchers claim that that DNN compo-nents is a prerequisite for autonomous driving28,2,79.However, how to integrate such components in asystem is an open question. Safety is a systemsissue, rather than a component specific issue. Allhazards introduced by both DNNs and conventionalsoftware must be analyzed within the context of sys-

§ http://torcs.sourceforge.net¶ http://www.cvlibs.net/datasets/kitti/‖ https://www.cityscapes-dataset.com/


tems engineering principles. On the other hand, thehazards can also be addressed on a system level.

One approach to achieve DNN safety is to intro-duce complementary components, i.e., when a DNNmodel fails to generalize, a conventional softwareor hardware component might step in to maintainsafe operation. During the workshops, particular at-tention was given to introducing a safety cage con-cept. Our discussions orbited a solution in whichthe DNN component was encapsulated by a super-visor, or a safety cage, that continuously monitorsthe input to the DNN component. The envisionedsafety cage should perform novelty detection 70 andalert when input does not belong within the train-ing region of the DNN component, i.e., if the risk offailed generalization was too high, the safety cageshould re-direct the execution to a safe-track. Thesafe-track should then operate without any ML com-ponents involved, enabling traditional approaches tosafety-critical software engineering.

The concept of an ML safety cage is in linewith Varshney’s discussions of “safe fail”101. Dif-ferent options to implement an ML safety cage in-clude adaptations of fail-silent systems17, plausibil-ity checks 52, and arbitration. However, Adler et al.1

indicated that the no free lunch theorem might ap-ply for safety cages, by stating that if tailored safetysafety cages are to be developed to safeguard againstdomain-specific malfunctions, thus, different safetycages may be required for different systems.

Introducing redundancy in the ML system is anapproach related to the safe track. One methodis to use ensemble methods in computer visionapplications61, i.e., employing multiple learning al-gorithms to improve predictive performance. Re-dundancy can also be introduced in an ML-basedsystem using hardware component, e.g., using an ar-ray of sensors of the same, or different, kind. In-creasing the amount of input data should increasethe probability of finding patterns closer to the train-ing data set. Combining data from various inputsources, referred to as sensor fusion, also helps over-coming the potential deficiencies of individual sen-sors.

4.2.3. V&V Approaches for Systems with DNNComponents

Developing approaches to engineer robust systemswith DNN components is not enough, the automo-tive industry must also develop novel approaches toV&V. V&V is a cornerstone in safety certification,but it still remains unclear how to develop a safetycase around applications with DNNs.

As pointed out in previous work, the currentISO 26262 standard is not applicable when develop-ing autonomous systems that rely on DNNs37. Ourworkshops corroborate this view, by identifying sev-eral open questions that need to be better under-stood:

• How is a DNN component classified inISO 26262? Should it be regarded as an indi-vidual software unit or a component?

• From a safety perspective, is it possible to treatDNN misclassifications as “hardware failures”? Ifyes, are the hardware failure target values definedin ISO 26262 applicable?

• ISO 26262 mandates complete test coverage ofthe software, but what does this imply for a DNN?What is sufficient coverage for a DNN?

• What metrics should be used to specify the DNNaccuracy? Should quality targets using such met-rics be used in the DNN requirements specifica-tions, and subsequently as targets for verificationactivities?

Apart from the open questions, our workshopparticipants identified several aspects that wouldsupport V&V. First, as requirements engineering isfundamental to high-quality V&V12, some work-shop participants requested a formal, or semi-formal, notation for requirements related to func-tional safety in the DNN context. Defining low-level requirements that would be verifiable appearsto be one of the greatest challenges in this area.Second, there is a need for a tool-chain and frame-work tailored to lifecycle management of systemswith DNN components – current solutions tailoredfor human-readable source code are not feasible andmust be complemented with too many immature in-ternal tools. Third, methods for test case generation


for DNN will be critical, as manual creation of testdata does not scale.

Finally, a major theme during the workshops washow to best use simulation as a means to supportV&V. We believe that the future will require mas-sive use of simulation to ensure safe DNN com-ponents. Consequently, there is a need to developsimulation strategies to cover both normal circum-stances as well as rare, but dangerous, traffic situa-tions. Furthermore, simulation might also be used toassess the sensitivity to adversarial examples.

4.3. The systematic snowballing

Table 5 shows the results from the five iterationsof the snowballing. In total, the snowballing pro-cedure identified 64 papers including the start set.We notice two publication peaks: 29 papers werepublished between 2002-2007 and 25 papers werepublished between 2013-2016. The former set ofpapers were dominated by research on using neu-ral networks for adaptive flight controllers, whereasthe latter set predominantly addresses the automo-tive domain. This finding suggests that organiza-tions currently developing ML-based systems forself-driving cars could learn from similar endeav-ors in the aerospace domain roughly a decade ago– while DNN was not available then, several aspectsof V&V enforced by aerospace safety standards aresimilar to ISO 26262. Note, however, that 19 of thepapers do not target any specific domain, but ratherdiscusses ML-based systems in general.

Table 2 shows the distribution of challenge andsolution proposal categories identified in the papers;‘#’ indicates the number of unique challenges or so-lution proposals matching a specific category. Aseach paper can report more than one challenge orsolution proposal, and the same challenge or solu-tion proposal can occur in more than one paper, thenumber of Paper IDs in the third column does notnecessarily match the ‘#’. The challenges most fre-quently mentioned in the papers relate to state-spaceexplosion and robustness, whereas the most com-monly proposed solutions constitute approaches thatbelong to formal methods, control theory, or proba-

bilistic methods.Regarding the publication years, we notice that

the discussion on state-space explosion primarilyhas been active in recent years, possibly explainedby the increasing application of DNNs. Looking atsolution proposals, we see that probabilistic meth-ods was particularly popular during the first publica-tion peak, and that research specifically addressingtest case design for ML-based systems has appearedfirst after 2012.

Fig. 3 shows a mapping between solution pro-posals categories and challenge categories. Some ofthe papers propose a solution to address challengesbelonging to a specific category. For each such in-stance, we connect solution proposals (to the left)and challenges (to the right), i.e., the width of theconnection illustrates the number of instances. Notethat we did put the solution proposal in [P4] (deploy-ment in real operational setting) in its own ‘Other’category. None of the proposed solutions addresschallenges related to the categories “Requirementsspecification” or “Systems engineering”, indicatinga research gap. Furthermore, “Transparency” is thechallenge category that has been addressed the mostin the papers, followed by “State-space explosion”.

Fig. 3. Mapping between categories of solution proposals(to the left) and challenges (to the right).

Two books summarize most findings from theaerospace domain identified through our systematicsnowballing. Taylor edited a book in 2006 that col-lected experiences for V&V of ANN technology96

in a project sponsored by the NASA Goddard SpaceFlight Center. Taylor concluded that the V&V tech-niques available at the time must evolve to tackle

∗∗The best practices were also later distilled into a guidance document intended for practitioners73


Table 2. Distribution of challenge and solution proposal cate-gories.

Challenge category # Paper IDsState-space explosion 6 [P3], [P15], [P16], [P47]Robustness 4 [P1], [P2], [P15], [P55]Systems engineering 2 [P1], [P55]Transparency 2 [P1], P55]Requirements specification 3 [P15], [P55]Test specification 3 [P16], [P46], [P55]Adversarial attacks 1 [P15]

Solution proposal category # Paper IDs

Formal methods 8[P3], [P26], [P42], [P28],[P37], P[40], [P44], [P53]

Control theory 7[P7], [P20], [P25], [P64],[P36], [P47], [P57], [P60]

Probabilistic methods 7[P18], [P30], [P31], [P32], [P33],[P35], [P50], [P52], [P54]

Test case design 5 [P9], [P10], [P12], [P17], [P21]Process guidelines 4 [P23], [P51], [P56], [P59]

ANNs. Taylor’s book reports five areas that needto be augmented to allow V&V of ANN-based sys-tems:∗∗

• Configuration management must track all addi-tional design elements, e.g., the training data, thenetwork architecture, and the learning algorithms.Any V&V activity must carefully specify the con-figuration under test.

• Requirements need to specify novel adaptive be-havior, including control requirements (how to ac-quire and act on knowledge) and knowledge re-quirements (what knowledge should be acquired).

• Design specifications must capture design choicesrelated to novel design elements such as trainingdata, network architecture, and activation func-tions. V&V of the ANN design should ensure thatthe choices are appropriate.

• Development lifecycles for ANNs are highly itera-tive and last until some quantitative goal has beenreached. Traditional waterfall software develop-ment is not feasible, and V&V must be an integralpart rather than an add-on.

• Testing needs to evolve to address novel re-

quirements. Structure testing should determinewhether the network architecture is better at learn-ing according to the control requirements than al-ternative architectures. Knowledge testing shouldverify that the ANN has learned what was speci-fied in the knowledge requirements.

The second book that has collected experienceson V&V of (mostly aerospace) ANNs, also fundedby NASA, was edited by Schumann and Liu andpublished in 201083. While the book primarily sur-veys the use of ANNs in high-assurance systems,parts of the discussion is focused on V&V – and theoverall conclusion that V&V must evolve to handleANNs is corroborated. In contrast to the organiza-tion we report in Table 2, the book suggests groupingsolution proposals into approaches that: 1) separateANN algorithms from conventional source code, 2)analyze the network architecture, 3) consider ANNsas function approximators, 4) tackle the opaquenessof ANNs, 5) assess the characteristics of the learn-ing algorithm, 6) analyze the selection and quality oftraining data, and 7) provides means for online mon-itoring of ANNs. We believe that our organization is


largely orthogonal to the list above, thus both couldbe used in a complementary fashion.

4.4. The survey

This section organizes the findings from the surveyinto closed questions, correlation analysis, and openquestions, respectively.

4.4.1. Closed questions

Forty-nine practitioners answered our survey, mostof them primarily working in Europe (38 out of 49,77.6%). Twenty respondents (40.8%) work primar-ily in the automotive domain, followed by 14 inaerospace (28.6%). Other represented domains in-clude process industry (5 respondents), railway (5respondents), and government/military (3 respon-dents). The respondents represent a variety of roles,from system architects (17 out of 49, 34.7%) toproduct developers (10 out of 49, 20.4%), and man-agerial roles (7 out of 49, 14.3%). Most respondentsprimarily work in Europe (38 out of 49, 77.6%) orNorth America (7 out of 49, 14.3%).

Most respondents have some proficiency in ML.Twenty-five respondents (51.0%) report having fun-damental awareness of ML concepts and practicalML concerns. Sixteen respondents (32.7%) havehigher proficiency, i.e., can implement ML solutionsindependently or with guidance – but no respondentsconsider themselves ML experts. On the other sideof the spectrum, eight respondents report possessingno ML knowledge.

We used three Likert items to assess the respon-dents’ general thoughts about ML and functionalsafety, reported as a)-c) in Table 3. Most respon-dents agree (or strongly agree) that applying ML insafety-critical applications will be important in theirorganizations in the future (29 out of 49, 59.2%),whereas eight (16.3%) disagree. At the same time,29 out of 49 (59.2%) of the respondents report thatV&V of ML-based features is considered particu-larly difficult by their organizations – 20 respondentseven strongly agrees with the statement. It is clearto our respondents that more attention is needed re-garding V&V of ML-based systems, as only 10 out

of 49 (20.4%) believe that their organizations arewell-prepared for the emerging paradigm.

Robustness (cf. e) in Table 3) stands out as theparticularly important challenge, reported as “ex-tremely important” by 29 out of 49 (59.2%). How-ever, all challenges covered in the questionnairewere considered important by the respondents. Theonly challenge that appears less urgent to the respon-dents is adversarial attacks, but the difference is mi-nor.

The respondents consider simulated test cases asthe most promising solution proposal to tackle chal-lenges in V&V of ML-based systems, reported asextremely promising by 18 out of 49 respondents(36.7%) and moderately promising by 12 respon-dents (24.5%). Probabilistic methods is the leastpromising solution proposal according to the re-spondents, followed by process guidelines.

4.4.2. Correlation analysis

We identified some noteworthy correlations in theresponses. The respondents’ ML proficiency (Q4)is moderately correlated (ρ = 0.53) with the per-ception of ML importance (Q5) – an expected find-ing as respondents with a personal investment arelikely to be biased. More interestingly, we found thatML proficiency was also moderately correlated totwo of the seven challenge categories: transparency(ρ = 0.61) and state-space explosion (ρ = 0.54).This suggests that these two challenges are partic-ularly difficult to comprehend for non-experts. Per-ceiving the organization as well-prepared for intro-ducing ML-based solutions (Q4) is moderately cor-related (ρ = 0.57) with considering systems engi-neering challenges (Q7) as particularly importantand weakly correlated regarding process guidelines(Q16) as a promising solution (ρ = 0.37). As theseare the only correlations with Q4, it indicates that or-ganizations that have reached a certain ML maturityhave progressed beyond specific issues and insteadfocus on the bigger picture, i.e, how to incorporateML in systems and how to adapt internal processesin the new ML era.

There are more correlations within the cate-gories of challenges (Q5-Q11) and solution propos-als (Q12-Q16) than between the two groups. The


Table 3. Answers to the closed questions of the survey. a)-c)show three Likert items, ranging from strongly disagree (1) tostrongly agree (5). d)- o) reports on importance/promisingnessusing the following ordinal scale: not at all, slightly, somewhat,moderately, and extremely. The ‘Missing‘” column includesboth “I don’t know” answers and missing answers.


only strong correlation between groups is test spec-ification (Q11) and formal methods (Q12) (ρ =0.71). Within the challenges, the correlation be-tween the two challenges state-space explosion (Q5)and transparency (Q8) stands out as particularlystrong (ρ = 0.91), illustrating the close connec-tion between these two issues with large DNN ar-chitectures. Also the two challenge categories re-quirements specifications (Q9) and test specifica-tions (Q11) are strongly correlated (ρ = 0.71), inline with a large body of previous work on aligningthe two concepts12.

4.4.3. Open questions

The end of the questionnaire contained an open-ended question (Q17), requesting a comment onFig. 3 and the accompanying findings: “althoughfew individual V&V challenges related to machinelearning transparency are highlighted in the litera-ture, it is the challenge most often addressed by theprevious publications’ solution proposals. We alsofind that the second most addressed challenge in pre-vious work is related to state-space explosion.”

Sixteen out of 49 respondents (32.7%) provideda free text answer to Q17, representing highly con-trasting viewpoints. Eight respondents reported thatthe findings were not in line with their expecta-tions, whether seven respondents agreed – one re-spondent neither agreed nor disagreed. Examplesof more important challenges emphasized by the re-spondents include both other listed challenges, i.e.,robustness and requirements specification, and otherchallenges, e.g., uncertainty of sensor data (in au-tomotive) and the knowledge gap between indus-try and regulatory bodies (in the process industry).Three respondents answer in general terms that themain challenge of ML-based systems is the intrinsicnon-determinism.

On the other hand, the agreeing respondents mo-tivate that state-space explosion is indeed the mostpressing challenge due to the huge input space ofthe operational environment (both in automotive andrailway applications). One automotive researcherstresses that the state-space explosion impedes rigidtesting but raises the transparency challenge as well– a lack thereof greatly limits analyzability, which is

a key requirement for safety-critical systems. Oneautomotive developer argues that the bigger state-space of the input domain, the bigger the attack sur-face becomes – possibly referring to both adversar-ial attacks and other antagonistic cyber attacks. Fi-nally, two respondents provide answers that encour-age us to continue work along to paths in the SMILEproject: 1) a tester in the railway domain explainsthat the traceability during root cause analyses inML-applications will be critical, in line with ourargumentation at a recent traceability conference15,and 2) one automotive architect argues that thestate-space explosion will not be the main challengeas any autonomous driving will have to be within“guard rails”, i.e., a solution similar to the safetycage architectures we intend to develop in the nextphase of the project.

Seven respondents complemented the survey an-swers with concluding thoughts in Q18. One expe-rienced manager in the aerospace domain explained:“What is now called ML was called neural nets (butless sophisticated) 30 years ago.”, a statement thatsupports our recommendation that the automotiveindustry should aim for a cross-domain knowledgetransfer regarding V&V of ML-based systems. Themanager followed by stating: “it (ML) introducesa new element in safety engineering. Or at least itmoves the emphasis to more resilience. If the classi-fier is wrong, then it becomes a hazard and the sys-tem must be prepared for it.” We agree with the re-spondent that actions needed in the hazardous situa-tion must be well-specified. Two respondents com-ment that conservatism is fundamental in functionalsafety, one of them elaborates that the “end of pre-dictability” introduced by ML is a disruptive changethat requires a paradigm shift.

5. Revisiting the RQs

This section first discusses the RQs in a larger con-text, and then aggregates the four sources of evi-dence presented in Fig. 2. Finally, we discuss im-plications for research and practice, including au-tomotive manufacturers and regulatory bodies, andconclude by reporting the main threats to validity.Table 4 summarizes our findings.


5.1. RQ1: State-of-the-art in V&V ofsafety-critical ML

There is no doubt that deep learning research cur-rently has incredible momentum. New applicationsand success stories are reported every month – andmany applications come from the automotive do-main. The rapid movement of the field is reflectedby the many papers our study has identified onpreprint archives, in particular the arXiv.org e-Printarchive. It is evident that researchers are eager toclaim novelty, and thus struggle to publish results asfast as possible.

While DNNs have enabled amazing break-throughs, there is much less published work on en-gineering safety for DNNs. On the other hand, weobserve a growing interest as several researchers callfor more research on DNN safety, as well as MLsafety in general. However, there is no agreementon how to best develop safety-critical DNNs, andseveral different approaches have been proposed.Contemporary research endeavors often address theopaqueness of DNNs, to support analyzability andinterpretability of systems with DNN components.

Deep learning research is in its infancy, and thetangible pioneering spirit sometimes brings the mindto the Wild West. Anything goes, and there is a po-tential for great academic recognition for ground-breaking papers. There is certainly more fame inshowcasing impressive applications than updatingengineering practices and processes.

Safety engineering stands as a stark contrast tothe pioneering spirit. On the contrary, safety is per-meated by conservatism. When a safety standardis developed, it captures the best available prac-tices to engineer safe systems. This approach in-evitably results in standards that lag behind the re-search front – safety first! In the automotive domain,ISO 26262 was developed long before DNNs for ve-hicles was an issue. Without question, DNNs consti-tute a paradigm shift in how to approach functionalsafety certification for automotive software, and wedo not believe in any quick fixes to patch ISO 26262for this new era. As recognized by researchers be-fore us, e.g., Salay et al.78, there is a considerablegap between ML and ISO 26262 – a gap that proba-bly needs to be bridged by new standards rather than

incremental updates of previous work.Broadening the discussion from DNNs to ML in

general, our systematic snowballing of previous re-search on safety-critical shows a peak of aerospaceresearch between 2002-2007 and automotive re-search dominating from 2013 and onwards. We no-tice that the aerospace domain allocated significantresources to research on neural networks for adap-tive flight controllers roughly a decade before DNNsbecame popular in automotive research. We hypoth-esize that considerable knowledge transfer betweenthe domains is possible now, and plan to proceedsuch work in the near future.

The academic literature on challenges in ML-based safety engineering has most frequently ad-dressed state-space explosion and robustness (seeTable 1 for definitions). On the other hand, themost commonly proposed solutions to overcomechallenges of ML-based safety engineering are ap-proaches that belong to formal methods, control the-ory, or probabilistic methods – but these appear to beonly moderately promising by industry practition-ers, who would rather see research on simulated testcases. As discussed in relation to RQ2, academiaand industry share a common view on what chal-lenges are important, but the level of agreement onwhat is the best way forward appears to be less clear.

5.2. RQ2: Main challenges for safe automotiveDNNs

Industry practice is far from certifying DNNs foruse in driverless safety-critical applications on pub-lic roads. Both the workshop series and the sur-vey show that industry practitioners across organiza-tions do not know how to tackle the challenge of ap-proaching regulatory bodies and certification agen-cies with DNN-based systems. Most likely, both au-tomotive manufacturers and safety standards need tolargely adapt to fit the new ML paradigm – the cur-rent gap appears not to be bridgeable in the foresee-able future through incremental science alone.

On the other hand, although the current safetystandards do not encompass ML yet, several auto-motive manufacturers are highly active in engineer-ing autonomous vehicles. Tesla has received signifi-cant media coverage through pioneering demonstra-


tions and self-confident statements. Volvo Cars isalso highly active through the Drive Me initiative,and has announced a long-lasting partnership withUber toward autonomous taxis.

Several other partnerships have recently beenannounced among automotive manufacturers, chip-makers, and ML-intensive companies. For exam-ple, Nvidia has partnered with Uber, Volkswagen,and Audi to support engineering self-driving carsusing their GPU computing technology for ML de-velopment. Nvidia has also partnered with the In-ternet company Baidu, a company that has a highlycompetitive ML research group. Similarly, the chip-maker Intel has partnered with Fiat Chrysler Auto-mobiles and the BMW Group to develop autonomyaround their Mobileye solution. Moreover, largeplayers such as Google, Apple, Ford, and Bosch areactive in the area, as well as startups such as nuTon-omy and FiveAI – no one wants to miss the boat tothe lucrative future.

While there are impressive achievements bothfrom spearheading research, and some features arealready available on the consumer market, they allhave in common that the safety case argumentationrelies on a human-in-the-loop. In case there is acritical situation, the human driver is expected tobe present and take control over the vehicle. Thereare joint initiatives to formulate regulations for au-tonomous vehicles, but, analogously, there is a needfor initiatives paving the way for new standards ad-dressing functional safety of systems that rely onML and DNNs.

We elicited the most pressing issues concerningengineering of DNN-based systems through a work-shop series and a survey with practitioners. Manydiscussions during the workshops were dominatedby robustness of DNN components, including de-tailed considerations about robust DNN architec-tures and the requirements on training data to learna robust DNN model. Also the survey shows theimportance of ML robustness, which motivates theattention it has received in academic publications(cf. RQ1). On the other hand, while there is anagreement on the importance of ML robustness be-tween academia and industry, how to tackle the phe-nomenon is still an open question – and thus a po-

tential avenue for future research. Nonetheless, theproblem of training a robust DNN component cor-responding to the complexity of public traffic con-forms with several of the “enduring problems” high-lighted by the US Air Force in their technical reporton V&V of autonomous systems [P1], e.g., state-space explosion and unpredictable environments.

While robustness is stressed by practitioners,academic publications have instead to a larger extenthighlighted challenges related to the limited trans-parency of ML-based systems (e.g., Bhattacharyyaet al.10) and the inevitable state-space explosion.The survey respondents confirm these challengesas well, but we recommend future studies to meetthe expectations from industry regarding robustnessresearch. Note, however, that the concept of ro-bustness might have different interpretations despitehaving a formal IEEE definition42. Consequently,we call for an empirical study to capture what in-dustry means by ML and DNN robustness in the au-tomotive context.

The workshop participants perceived two possi-ble approaches to pave the way for safety-criticalDNNs as especially promising. First, continuousmonitoring of DNN input using a safety cage ar-chitecture, a concept that has been proposed forexample by Adler et al.1. Monitoring safe oper-ation, and re-directing execution to a “safe track”without DNN involvement when uncertainties growtoo large, is an example of the safety strategy safefail101. Another approach to engineering ML safety,considered promising by the workshops and the sur-vey respondents alike, is to simulate test cases.

6. Conclusion and future work

Deep learning Neural Networks (DNN) is key toenable the vehicular perception required for au-tonomous driving. However, the behavior of DNNcomponents cannot be guaranteed by traditionalsoftware and system engineering approaches. Ontop of that, crucial parts of the automotive safetystandard ISO 26262 are not well-defined for certi-fying autonomous systems78,39 – certain process re-quirements contravene the nature of developing Ma-chine Learning (ML)-based systems, especially in


Table 4: Condensed findings in relation to the research questions, and implications for research and practice.RQ1. What is the state-of-the-art in V&V of ML-based safety-critical systems?

• Most ML research showcases applications, while development on ML V&V islagging behind.

• Considerable gap between V&V mandated by safety standards and nature ofcontemporary ML-based systems.

• The aerospace domain has collected experiences from V&V of adaptive flightcontrollers based on neural networks.

• Support for V&V of ML-based systems can be organized into: 1) formal meth-ods, 2) control theory, 3) probabilistic methods, 4) process guidelines, and 5)simulated test cases.

• Academia has focused mostly on 1)–3), whereas industry perceives 5) as themost promising.

RQ2. What are the main chal-lenges when engineering safety-critical systems with DNN com-ponents in the automotive do-main?

• How to certify safety-critical systems with DNNs for use on public roads isunclear.

• Industry stresses robustness, whereas academia most often addresses state-space explosion and the lack of ML transparency.

• Challenges elicited corroborate work on V&V by NASA and USAF, coveringneural networks, autonomous systems, and adaptive systems.

Implications for research andpractice

• Gap between ML practice and ISO 26262 requires novel standards rather thanincremental updates.

• Cross-domain knowledge transfer from the aerospace V&V engineers to theautomotive domain appears promising.

• Need for empirical studies to clarify what robustness means in the context ofDNN-based autonomous vehicles.

• Systems-based safety approaches encouraged by industry, including safety cagearchitectures and simulated test cases.


relation to Verification and Validation (V&V).

Roughly a decade ago, using Artificial NeuralNetworks (ANN) in flight controllers was an ac-tive research topic, and also how to adhere to thestrict aerospace safety standards. Now, in the ad-vent of autonomous driving, we recommend the au-tomotive industry to learn from guidelines73 andlessons learned96 from V&V of ANN-based com-ponents developed to conform with the DO-178Bsoftware safety standard for airborne systems. Inparticular, automotive software developers need toevolve practices for configuration management andarchitecture specifications to encompass fundamen-tal DNN design elements. Also, requirements spec-ifications and the corresponding software testingmust be augmented to address the adaptive behav-ior of DNNs. Finally, the highly iterative develop-ment lifecycle of DNNs should be aligned with thetraditional automotive V-model for systems develop-ment. A recent NASA report on safety certificationof adaptive aerospace systems10 confirms the chal-lenges of requirements specification and softwaretesting. Moreover, related to ML, the report addsthe lack of documentation and traceability in manyopen source libraries, and the issue of an exper-tise gap between regulators and engineers – conven-tional source code in C/C++ is very different from anopaque ML model trained on a massive dataset.

The work most similar to ours also originated inthe aerospace domain, i.e., a project initiated by theUS Air Force to describe enduring problems (and fu-ture possibilities) in relation to safety certification ofautonomous systems [P1]. The project highlightedfour primary challenges: 1) state-space explosion,2) unpredictable environments, 3) emergent behav-ior, and 4) human-machine communication. Whilenot explicitly discussing ML, the first two findingsmatch the most pressing needs elicited in our work,i.e., state-space explosion as stressed by the aca-demic literature (in combination with limited trans-parency) and robustness as emphasized by the work-shop participants as well as the survey respondents(referred to as unpredictable environments in [P1]).

After having reviewed the state-of-the-art andstate-of-practice, the SMILE project will now em-bark on a solution-oriented journey. Based on the

workshops, and motivated by the survey respon-dents, we conclude that pursuing a solution basedon safety cage architectures37,1 encompassing DNNcomponents is a promising direction. Our rationaleis three-fold. First, the results from the workshopswith automotive experts from industry clearly mo-tivates us, i.e., the participants strongly encouragedus to explore such a solution as the next step. Sec-ond, we believe it would be feasible to develop asafety case around a safety cage architecture, sincethe automotive industry already uses the concept inthe physical vehicles. Third, we believe the DNNtechnology is ready to provide what is needed interms of novelty detection. The safety cage archi-tecture we envision will continuously monitor in-put data from the operational environment to re-direct execution to a non-ML safe track when un-certainties grow too large. Consequently, we ad-vocate DNN safety strategies using a systems-basedapproach rather than techniques that focus on the in-ternals of DNNs. Finally, also motivated by both theworkshops and the survey respondents, we proposean approach to V&V that makes heavy use of sim-ulation – in line with previous recommendations byother researchers 9,14,99.

Future work will also study how transfer learn-ing could be used to incorporate training data fromdifferent contexts or manufacturers, or even includesynthetic data from simulators, into DNNs for real-world automotive perception. So far we have mostlylimited the discussion to fixed DNN-based systems,i.e., systems trained only prior to deployment. Anobvious direction for future work is to to explorehow dynamic DNNs would influence our findings,i.e., DNNs that adapt by continued learning either inbatches or through online learning. Furthermore, re-search on V&V of ML-based systems is more com-plex than pure technology in isolation. Thus, werecognize the need to explore both ethical and le-gal aspects involved in safety certification of ML-based systems. Finally, there is a new automotivestandard under development that will address au-tonomous safety: ISO/PAS 21448 Road vehicles –Safety of the intended functionality. We are notaware of its contents at the time of this writing, butonce published, we will use it as an important refer-


ence point for our future solution proposals.

Acknowledgments

Thanks go to all participants in the SMILE work-shops, in particular Carl Zanden, Michal Simoen,and Konstantin Lindstrom. This work was carriedout within the SMILE and SMILE II projects fi-nanced by Vinnova, FFI, Fordonsstrategisk forskn-ing och innovation under the grant numbers: 2016-04255 and 2017-03066.

References

1. R. Adler, P. Feth, and D. Schneider. Safety Engineer-ing for Autonomous Vehicles. In Proc. of the 46th An-nual IEEE/IFIP International Conference on Depend-able Systems an dNetworks Workshops, pages 200–205, 2016.

2. Ankur Agrawal, Chia-Yu Chen, Jungwook Choi,Kailash Gopalakrishnan, Jinwook Oh, Sunil Shukla,Viji Srinivasan, Swagath Venkataramani, and WeiZhang. Accelerator Design for Deep Learning Train-ing: Extended Abstract: Invited. In Proc. of the 54thAnnual Design Automation Conference, pages 57:1–57:2, 2017.

3. A. K. Akametalu, J. F. Fisac, J. H. Gillula, S. Kay-nama, M. N. Zeilinger, and C. J. Tomlin. Reachability-based safe learning with Gaussian processes. In Proc.of the 53rd IEEE Conference on Decision and Con-trol, pages 1424–1431, 2014.

4. Rob Alexander, Heather Rebecca Hawkins, and An-drew John Rae. Situation coverage - A coverage crite-rion for testing autonomous robots. Technical report,University of York, 2015.

5. Dario Amodei, Chris Olah, Jacob Steinhardt, PaulChristiano, John Schulman, and Dan Mane. ConcreteProblems in AI Safety. arXiv:1606.06565, 2016.

6. P. J. Angeline, G. M. Saunders, and J. B. Pollack. Anevolutionary algorithm that constructs recurrent neu-ral networks. IEEE Transactions on Neural Networks,5(1):54–65, 1994.

7. James Arnold and Rob Alexander. Testing Au-tonomous Robot Control Software Using ProceduralContent Generation. In Computer Safety, Reliability,and Security, pages 33–44. Springer, Berlin, 2013.

8. Nicholas J. Bahr. System Safety Engineering and RiskAssessment: A Practical Approach, Second Edition.CRC Press, 2014.

9. Raja Ben Abdessalem, Shiva Nejati, Lionel C Briand,and Thomas Stifter. Testing advanced driver assis-tance systems using multi-objective search and neural

networks. In Proc. of the 31st International Confer-ence on Automated Software Engineering, pages 63–74. ACM, 2016.

10. Siddhartha Bhattacharyya, Darren Cofer, D Musliner,Joseph Mueller, and Eric Engstrom. Certification con-siderations for adaptive systems. Technical ReportCR2015-218702, NASA, 2015.

11. R. Billinton and R. Allan, editors. Reliability evalua-tion of engineering systems: concepts and techniques.Springer Science & Business Media, 2013.

12. E. Bjarnason, P. Runeson, M. Borg, M. Unterkalm-steiner, E. Engstrom, B. Regnell, G. Sabaliauskaite,A. Loconsole, T. Gorschek, and R. Feldt. Challengesand practices in aligning requirements with verifica-tion and validation: a case study of six companies.Empirical Software Engineering, 19(6):1809–1855,2014.

13. Mariusz Bojarski, A. Choromanska, K. Choroman-ski, B. Firner, L. Jackel, U. Muller, and K. Zieba.VisualBackProp: efficient visualization of CNNs.arXiv:1611.05418, 2016.

14. Mariusz Bojarski, Davide Del Testa, DanielDworakowski, Bernhard Firner, Beat Flepp, Pra-soon Goyal, Lawrence D. Jackel, Mathew Monfort,Urs Muller, Jiakai Zhang, Xin Zhang, Jake Zhao, andKarol Zieba. End to End Learning for Self-DrivingCars. arXiv:1604.07316, 2016. arXiv: 1604.07316.

15. M. Borg, C. Englund, and B. Duran. Traceabil-ity and Deep Learning - Safety-critical Systems withTraces Ending in Deep Neural Networks. In Proc. ofthe Grand Challenges of Traceability: The Next TenYears, pages 48–49, 2017.

16. J. Bosworth and P. Williams-Hayes. Flight Test Re-sults from the NF-15b Intelligent Flight Control Sys-tem Project with Adaptation to a Simulated StabilitorFailure. Technical report, NASA TM-2007-214629,2007.

17. F. V. Brasileiro, P. D. Ezhilchelvan, S. K. Shrivastava,N. A. Speirs, and S. Tao. Implementing fail-silentnodes for distributed systems. IEEE Transactions onComputers, 45(11):1226–1238, 1996.

18. G. Brat and A. Jonsson. Challenges in verification andvalidation of autonomous systems for space explo-ration. In Proc. of the IEEE International Joint Con-ference on Neural Networks, volume 5, pages 2909–2914 vol. 5, 2005.

19. R. L. Broderick. Statistical and adaptive approach forverification of a neural-based flight control system. InProc. of the 23rd Digital Avionics Systems Confer-ence, volume 2, pages 6.E.1–61–10, 2004.

20. R. L. Broderick. Adaptive verification for an on-linelearning neural-based flight control system. In Proc.of the 24th Digital Avionics Systems Conference, vol-ume 1, pages 6.C.2–61–10, 2005.

21. A. Broggi, M. Buzzoni, S. Debattisti, P. Grisleri,


Table 5: The start set and the four subsequent iterations of the snowballing literature review.Start set [P1] M. Clark et al.24, [P2] D. Amodei et al.5, [P3] G. Brat and A. Jonsson18, [P4] A. Broggi et

al.21, [P5] B. Taylor et al.97, [P6] J. Taylor et al.98, [P7] A. Carvalho et al.23, [P8] S. Ramos etal.75, [P9] R. Alexander et al.4, [P10] X. Zou et al.112, [P11] X. Zou et al.113, [P12] J. Arnold andR. Alexander7, [P13] S. Sivaraman and M. Trivedi89, [P14] A. Mozaffari et al.65

Iteration 1 [P15] S. Seshia et al. 86, [P16] P. Helle et al.38, [P17] L. Li et al.,57, [P18] W. Shi et al.87, [P19]K. Sullivan et al.94, [P20] R. Broderick20, [P21] N. Li et al.58, [P22] S. Russell et al.77, [P23] A.Broggi et al.22, [P24] J. Schumann and S. Nelson84, [P25] J. Hull et al.41, [P26] L. Pulina and A.Tacchella71, [P27] S. Lefevre et al.55

Iteration 2 [P28] X. Huang et al.40, [P29] K. Sullivan et al.94, [P30] P. Gupta P and J. Schumann32, [P31]J. Schumann et al.81, [P32] R. Broderick19, [P33] Y. Liu et al.59, [P34] S. Yerramalla et al.106,[P35] R. Zakrzewski109, [P36] S. Yerramalla et al.106, [P37] G. Katz et al.50, [P38] A. Akametaluet al.3, [P39] S. Seshia et al.85, [P40] A. Mili et al.63, [P41] Z. Kurd et al.53, [P42] L. Pulina andA. Tacchella72, [P43] J. Schumann et al.84, [P44] R. Zakrzewski107, [P45] D. Mackall et al.60

Iteration 3 [P46] S. Jacklin et al.48, [P47] N. Nguyen and S. Jacklin67, [P48] J. Schumann and Y. Liu82, [P49]N. Nguyen and S. Jacklin68, [P50] S. Jacklin et al.49, [P51] J. Taylor et al.98, [P52] G. Li et al.56,[P53] P. Gupta et al.33, [P54] K. Scheibler et al.80, [P55] P. Gupta et al.31 , [P56] S. Jacklin etal.47, [P57] V. Cortellessa et al.25, [P58] S. Yerramalla et al.105

Iteration 4 [P58] S. Jacklin,46, [P59] F. Soares et al.92, [P60] X. Zhang et al.110, [P61] F. Soares andJ. Burken91, [P62] C. Torens et al.100, [P63] J. Bosworth and P. Williams-Hayes16, [P64] R.Zakrzewski108

M. C. Laghi, P. Medici, and P. Versari. ExtensiveTests of Autonomous Driving Technologies. IEEETransactions on Intelligent Transportation Systems,14(3):1403–1415, 2013.

22. A. Broggi, P. Cerri, P. Medici, P. P. Porta, and G. Ghi-sio. Real Time Road Signs Recognition. In 2007 IEEEIntelligent Vehicles Symposium, pages 981–986, 2007.

23. A. Carvalho, S. Lefevre, G. Schildbach, J. Kong, andF. Borelli. Automated driving: The role of forecastsand uncertainty - A control perspective. EuropeanJournal of Control, 2015.

24. Matthew Clark, Kris Kearns, Jim Overholt, KerianneGross, Bart Barthelemy, and Cheryl Reed. Air ForceResearch Laboratory Test and Evaluation, Verifica-tion and Validation of Autonomous Systems Chal-lenge Exploration. Technical report, Air Force Re-search Lab Wright-Patterson, 2014.

25. Vittorio Cortellessa, Bojan Cukic, Diego Del Gobbo,Ali Mili, Marcello Napolitano, Mark Shereshevsky,and Harjinder Sandhu. Certifying Adaptive FlightControl Software. In Proc. of the Software Risk Man-agement Conf., 2000.

26. D. S. Cruzes and T. Dyba. Recommended Steps forThematic Synthesis in Software Engineering. In InProc. of the International Symposium on EmpiricalSoftware Engineering and Measurement, pages 275–

284, 2011.27. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoff-

man, Ning Zhang, Eric Tzeng, and Trevor Darrell. De-CAF: A Deep Convolutional Activation Feature forGeneric Visual Recognition. In PMLR, pages 647–655, 2014.

28. F. Falcini, G. Lami, and A. Costanza. Deep Learningin Automotive Software. IEEE Software, 34(3):56–63, 2017.

29. Xavier Glorot, Antoine Bordes, and Yoshua Bengio.Domain Adaptation for Large-scale Sentiment Classi-fication: A Deep Learning Approach. In Proc. of the28th International Conference on International Con-ference on Machine Learning, pages 513–520, 2011.

30. Ian Goodfellow. NIPS 2016 Tutorial: Generative Ad-versarial Networks. arXiv:1701.00160, 2016.

31. P. Gupta, Ph D, K. A. Loparo, Ph D, D. Mackall,J. Schumann, Ph D, and F. R. Soares. Verification andValidation Methodology of Real-time Adaptive Neu-ral Networks for Aerospace Applications. Technicalreport, NASA, 2004.

32. P. Gupta and J. Schumann. A tool for verificationand validation of neural network based adaptive con-trollers for high assurance systems. In Proc. of the8th IEEE International Symposium on High Assur-ance Systems Engineering, pages 277–278, 2004.


33. Pramod Gupta, Kurt Guenther, John Hodgkinson,Stephen Jacklin, Michael Richard, Johann Schumann,and Fola Soares. Performance Monitoring and Assess-ment of Neuro-Adaptive Controllers for AerospaceApplications Using a Bayesian Approach. In AIAAGuidance, Navigation, and Control Conference andExhibit. American Institute of Aeronautics and Astro-nautics, 2005.

34. A. Gurghian, T. Koduri, S. Bailur, K. Carey, andV. Murali. DeepLanes: End-To-End Lane Position Es-timation Using Deep Neural Networks. In Proc. ofthe IEEE Conference on Computer Vision and PatternRecognition Workshops, pages 38–45, 2016.

35. Kaiming He, Xiangyu Zhang, Shaoqing Ren, andJian Sun. Delving Deep into Rectifiers: SurpassingHuman-Level Performance on ImageNet Classifica-tion. In Proc. of the International Conference on Com-puter Vision, 2015.

36. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep Residual Learning for Image Recognition.In Pro. of the IEEE Conference on Computer Visionand Pattern Recognition, pages 770–778, 2016.

37. Karl Heckemann, Manuel Gesell, Thomas Pfister,Karsten Berns, Klaus Schneider, and Mario Trapp.Safe Automotive Software. In Knowledge-Basedand Intelligent Information and Engineering Systems,pages 167–176. Springer, Berlin, Heidelberg, 2011.

38. P. Helle, W. Schamai, and C. Strobel. Testing of Au-tonomous Systems - Challenges and Current State-of-the-Art. In Proc. of the 26th Annutal INCOSE Inter-national Symposium, pages 571–584, 2016.

39. Jens Henriksson, Markus Borg, and Cristofer En-glund. Automotive safety and machine learning: Ini-tial results from a study on how to adapt the iso 26262safety standard. In Proc. of the 1st International Work-shop on Software Engineering for AI in AutonomousSystems, pages 47–49. IEEE, 2018.

40. Xiaowei Huang, Marta Kwiatkowska, Sen Wang, andMin Wu. Safety Verification of Deep Neural Net-works. arXiv:1610.06940, 2016.

41. J. Hull, D. Ward, and R. R. Zakrzewski. Verificationand validation of neural networks for safety-criticalapplications. In Proc. of the 2002 American ControlConference, volume 6, pages 4789–4794, 2002.

42. {IEEE Computer Society}. 610.12-1990 IEEE stan-dard glossary of software engineering terminology.Technical report, 1990.

43. {International Electrotechnical Commission}. IEC61508 ed 1.0, Electrical/electronic/programmableelectronic safety-related systems. 2010.

44. {International Organization for Standardization}. ISO26262 Road vehicles - Functional safety. 2011.

45. Martin Ivarsson and Tony Gorschek. A method forevaluating rigor and industrial relevance of technol-ogy evaluations. Empirical Software Engineering,

16(3):365–395, 2011.46. Stephen Jacklin. Closing the Certification Gaps in

Adaptive Flight Control Software. In AIAA Guid-ance, Navigation and Control Conference and Ex-hibit. American Institute of Aeronautics and Astro-nautics, 2008.

47. Stephen Jacklin, Johann Schumann, Pramod Gupta,M Lowry, John Bosworth, Eddie Zavala, Kelly Hay-hurst, Celeste Belcastro, and Christine Belcastro. Ver-ification, Validation, and Certification Challenges forAdaptive Flight-Critical Control System Software. InAIAA Guidance, Navigation, and Control Conferenceand Exhibit. 2004.

48. Stephen Jacklin, Johann Schumann, Pramod Gupta,Michael Richard, Kurt Guenther, and Fola Soares.Development of Advanced Verification and Valida-tion Procedures and Tools for the Certification ofLearning Systems in Aerospace Applications. In In-fotech@Aerospace. American Institute of Aeronauticsand Astronautics, 2005.

49. Stephen A. Jacklin, Johann Schumann, John T.Bosworth, Peggy S. Williams-Hayes, and Richard S.Larson. Case Study: Test Results of a Tool andMethod for In-Flight, Adaptive Control System Ver-ification on a NASA F-15 Flight Research Aircraft.In Proc. of the 7th World Congress on Computa-tional Mechanics Minisymposium: Accomplishmentsand Challenges in Verification and Validation, 2006.

50. Guy Katz, Clark Barrett, David Dill, Kyle Julian,and Mykel Kochenderfer. Reluplex: An EfficientSMT Solver for Verifying Deep Neural Networks.arXiv:1702.01135, 2017.

51. Alessia Knauss, Jan Schroeder, Christian Berger, andHenrik Eriksson. Software-related Challenges of Test-ing Automated Vehicles. In Proc. of the 39th Interna-tional Conference on Software Engineering Compan-ion, pages 328–330, 2017.

52. Matthias Korte, Frdric Holzmann, Gerd Kaiser, VolkerScheuch, and Hubert Roth. Design of a Robust Plau-sibility Check for an Adaptive Vehicle Observer in anElectric Vehicle. In Advanced Microsystems for Auto-motive Applications 2012, pages 109–119. Springer,Berlin, Heidelberg, 2012.

53. Zeshan Kurd, Tim Kelly, and Jim Austin. Develop-ing artificial neural networks for safety critical sys-tems. Neural Computing and Applications, 16(1):11–19, 2007.

54. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton.Deep learning. Nature, 521(7553):436–444, 2015.

55. Stephanie Lefevre, Dizan Vasquez, and ChristianLaugier. A survey on motion prediction and risk as-sessment for intelligent vehicles. ROBOMECH Jour-nal, 1(1):1–14, 2014.

56. G. Li, M. Lu, and B. Liu. A Scenario-Based Methodfor Safety Certification of Artificial Intelligent Soft-


ware. In Proc. of the 2010 International Conferenceon Artificial Intelligence and Computational Intelli-gence, volume 3, pages 481–483, 2010.

57. L. Li, W. L. Huang, Y. Liu, N. N. Zheng, and F. Y.Wang. Intelligence Testing for Autonomous Vehicles:A New Approach. IEEE Transactions on IntelligentVehicles, 1(2):158–166, 2016.

58. Nan Li, Dave Oyler, Mengxuan Zhang, YildirayYildiz, Ilya Kolmanovsky, and Anouck Girard. Game-Theoretic Modeling of Driver and Vehicle Interactionsfor Verification and Validation of Autonomous Vehi-cle Control Systems. arXiv:1608.08589, 2016.

59. Yan Liu, Bojan Cukic, and Srikanth Gururajan. Vali-dating neural network-based online adaptive systems:a case study. Software Quality Journal, 15(3):309–326, 2007.

60. D. Mackall, S. Nelson, and J. Schumman. Verifica-tion and validation of neural networks for aerospacesystems. Technical report, 2002.

61. D. Maji, A. Santara, P. Mitra, and D. Sheet. En-semble of Deep Convolutional Neural Networks forLearning to Detect Retinal Vessels in Fundus Images.arXiv:1603.04833, 2016.

62. E. Mhamdi, Rachid Guerraoui, and SebastienRouault. On The Robustness of a Neural Network.arXiv:1707.08167, 2017.

63. A. Mili, GuangJie Jiang, B. Cukic, Yan Liu, and R. B.Ayed. Towards the verification and validation of on-line learning systems: general framework and appli-cations. In Proc. of the 37th Annual Hawaii Interna-tional Conference on System Sciences, 2004.

64. Niklas Moller and Sven Ove Hansson. Principles ofengineering safety: Risk and uncertainty reduction.Reliability Engineering & System Safety, 93(6):798–805, 2008.

65. Ahmad Mozaffari, Mahyar Vajedi, and Nasser L.Azad. A robust safety-oriented autonomous cruisecontrol scheme for electric vehicles based on modelpredictive control and online sequential extreme learn-ing machine with a hyper-level fault tolerance-basedsupervisor. Neurocomputing, 151:845 – 856, 2015.

66. {National Instruments}. What is the ISO 26262 Func-tional Safety Standard?

67. Nhan T. Nguyen and Stephen A. Jacklin. Neural NetAdaptive Flight Control Stability. In Verification andValidation Challenges, and Future Research, IJCNNConference, 2007.

68. Nhan T. Nguyen and Stephen A. Jacklin. Stability,Convergence, and Verification and Validation Chal-lenges of Neural Net Adaptive Flight Control. In Jo-hann Schumann and Yan Liu, editors, Applications ofNeural Networks in High Assurance Systems, pages77–110. Springer Berlin Heidelberg, Berlin, Heidel-berg, 2010.

69. Maxime Oquab, Leon Bottou, Ivan Laptev, and Josef

Sivic. Learning and Transferring Mid-Level Im-age Representations using Convolutional Neural Net-works. In Proc. of the IEEE Conference on Com-puter Vision and Pattern Recognition, pages 1717–1724, 2014.

70. Marco AF Pimentel, David A Clifton, Lei Clifton, andLionel Tarassenko. A review of novelty detection. Sig-nal Processing, 99:215–249, 2014.

71. Luca Pulina and Armando Tacchella. An Abstraction-Refinement Approach to Verification of ArtificialNeural Networks. In Computer Aided Verification,pages 243–257. Springer, Berlin, Heidelberg, 2010.

72. Luca Pulina and Armando Tacchella. NeVer: a toolfor artificial neural networks verification. Annals ofMathematics and Artificial Intelligence, 62(3):403–425, 2011.

73. Laura Pullum, Brian Taylor, and Majorie Darrah.Guidance for the Verification and Validation of NeuralNetworks. John Wiley & Sons, Inc., 2007.

74. Alec Radford, Luke Metz, and Soumith Chin-tala. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Net-works. arXiv:1511.06434, 2015.

75. Sebastian Ramos, Stefan Gehrig, Peter Pinggera, UweFranke, and Carsten Rother. Detecting UnexpectedObstacles for Self-Driving Cars: Fusing Deep Learn-ing and Geometric Modeling. arXiv:1612.06573,2016.

76. Louis M. Rea and Richard A. Parker. Designingand Conducting Survey Research: A ComprehensiveGuide. John Wiley & Sons, 4 edition, 2014.

77. S. Russell, D. Dewey, and M. Tegmark. ResearchPriorities for Robust and Beneficial Artificial Intelli-gence. arXiv:1602.03506, 2016.

78. Rick Salay, Rodrigo Queiroz, and Krzysztof Czar-necki. An Analysis of ISO 26262: Using Ma-chine Learning Safely in Automotive Software.arXiv:1709.02435, 2017. arXiv: 1709.02435.

79. Ahmad Sallab, Mohammed Abdou, Etienne Perot,and Senthil Yogamani. End-to-End Deep Re-inforcement Learning for Lane Keeping Assist.arXiv:1612.04340, 2016.

80. Karsten Scheibler, Leonore Winterer, Ralf Wimmer,and Bernd Becker. Towards Verification of ArtificialNeural Networks. In Proc. of MBMV, 2015.

81. J. Schumann, P. Gupta, and S. Jacklin. Toward Ver-ification and Validation of Adaptive Aircraft Con-trollers. In Proc. of the IEEE Aerospace Conference,pages 1–6, 2005.

82. J. Schumann and Y. Liu. Tools and Methods for theVerification and Validation of Adaptive Aircraft Con-trol Systems. In Proc. of the IEEE Aerospace Confer-ence, pages 1–8, 2007.

83. Johann Schumann, Pramod Gupta, and Yan Liu. Ap-plication of Neural Networks in High Assurance Sys-


tems: A Survey. In Applications of Neural Networksin High Assurance Systems, Studies in ComputationalIntelligence, pages 1–19. Springer, Berlin, Heidel-berg, 2010.

84. Johann Schumann and Stacy Nelson. Toward V&Vof Neural Network Based Controllers. In Proc. of the1st Workshop on Self-healing Systems, pages 67–72,2002.

85. Sanjit A. Seshia, Dorsa Sadigh, and S. Shankar Sas-try. Formal Methods for Semi-autonomous Driving.In Proc. of the 52nd Annual Design Automation Con-ference, pages 148:1–148:5, 2015.

86. Sanjit A. Seshia, Dorsa Sadigh, and S. ShankarSastry. Towards Verified Artificial Intelligence.arXiv:1606.08514, 2016.

87. Weijing Shi, Mohamed Baker Alawieh, Xin Li,Huafeng Yu, Nikos Arechiga, and Nobuyuki Tomatsu.Efficient Statistical Validation of Machine LearningSystems for Autonomous Driving. In Proc. of the 35thInternational Conference on Computer-Aided Design,pages 36:1–36:8, 2016.

88. Karen Simonyan and Andrew Zisserman. VeryDeep Convolutional Networks for Large-Scale ImageRecognition. arXiv:1409.1556, 2014.

89. Sayanan Sivaraman and Mohan M. Trivedi. ActiveLearning for On-road Vehicle Detection: A Compara-tive Study. Mach. Vision Appl., 25(3):599–611, 2014.

90. Leon Sixt, Benjamin Wild, and Tim Landgraf.RenderGAN: Generating Realistic Labeled Data.arXiv:1611.01331, 2016.

91. F. Soares and J. Burken. A Flight Test Demonstra-tion of On-line Neural Network Applications in Ad-vanced Aircraft Flight Control System. In Proc. ofthe 2006 International Conference on ComputationalInteligence for Modelling Control and Automationand International Conference on Intelligent AgentsWeb Technologies and International Commerce, pages136–136, 2006.

92. Fola Soares, John Burken, and Tshilidzi Marwala.Neural Network Applications in Advanced AircraftFlight Control System, a Hybrid System, a Flight TestDemonstration. In Irwin King, Jun Wang, Lai-WanChan, and DeLiang Wang, editors, Neural Informa-tion Processing, pages 684–691, Berlin, Heidelberg,2006. Springer Berlin Heidelberg.

93. B. Spanfelner, D. Richter, S. Ebel, U. Wilhelm,W. Branz, and C. Patz. Challenges in applying theISO 26262 for driver assistance. In Proc. of the Schw-erpunkt Vernetzung, 5. Tagung Fahrerassistenz, Mu-nich, Germany, 2012.

94. K. B. Sullivan, K. M. Feigh, F. T. Durso, U. Fischer,V. L. Pop, K. Mosier, J. Blosch, and D. Morrow. Us-ing neural networks to assess human-automation in-teraction. In 2011 IEEE/AIAA 30th Digital AvionicsSystems Conference, pages 6A4–1–6A4–10, 2011.

95. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Ser-manet, Scott Reed, Dragomir Anguelov, Dumitru Er-han, Vincent Vanhoucke, and Andrew Rabinovich.Going Deeper With Convolutions. In Proc. of theIEEE Conference on Computer Vision and PatternRecognition, pages 1–9, 2015.

96. Brian J. Taylor. Methods and Procedures for the Ver-ification and Validation of Artificial Neural Networks.Springer Science & Business Media, 2006. Google-Books-ID: ax3Q YBuXFEC.

97. Brian J. Taylor, Marjorie A. Darrah, and Christina D.Moats. Verification and validation of neural net-works: a sampling of research in progress. In Proc.of SPIE 5103, Intelligent Computing: Theory and Ap-plications, volume 5103, pages 8–16, 2003.

98. J. Taylor, E. Yudkowsky, P. LaVictoire, and A. Critch.Alignment for advanced machine learning systems.Technical report, Machine Intelligence Research In-stitute, 2016.

99. Yuchi Tian, Kexin Pei, Suman Jana, and BaishakhiRay. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proc. of the 40thInternational Conference on Software Engineering,pages 303–314. ACM, 2018.

100. Christoph Torens, Florian-M. Adolf, and Lukas Goor-mann. Certification and Software Verification Consid-erations for Autonomous Unmanned Aircraft. Journalof Aerospace Information Sys., 11(10):649–664, 2014.

101. K. Varshney. Engineering safety in machine learning.In Proc. of the 2016 Information Theory and Appl.Workshop, pages 1–5, January 2016.

102. K. Varshney, R. Prenger, T. Marlatt, B. Chen, andW. Hanley. Practical Ensemble Classification Er-ror Bounds for Different Operating Points. IEEETransactions on Knowledge and Data Engineering,25(11):2590–2601, 2013.

103. C. Wohlin. Guidelines for Snowballing in SystematicLiterature Studies and a Replication in Software Engi-neering. In Proc. of the 18th International Conferenceon Evaluation and Assessment in Software Engineer-ing, pages 38:1–38:10, 2014.

104. X. Yao and Y. Liu. A new evolutionary system forevolving artificial neural networks. IEEE Transac-tions on Neural Networks, 8(3):694–713, 1997.

105. Sampath Yerramalla, Edgar Fuller, Martin Mladen-ovski, and Bojan Cukic. Lyapunov Analysis of NeuralNetwork Stability in an Adaptive Flight Control Sys-tem. In Proc. of the 6th International Conference onSelf-stabilizing Systems, pages 77–92, 2003.

106. Sampath Yerramalla, Yan Liu, Edgar Fuller, BojanCukic, and Srikanth Gururajan. An Approach toV&V of Embedded Adaptive Systems. In Michael G.Hinchey, James L. Rash, Walter F. Truszkowski, andChristopher A. Rouff, editors, Formal Approaches toAgent-Based Systems, pages 173–188, Berlin, Heidel-


berg, 2005. Springer Berlin Heidelberg.107. R. R. Zakrzewski. Verification of a trained neural net-

work accuracy. In Proc. of the International JointConference on Neural Networks, volume 3, pages1657–1662, 2001.

108. R. R. Zakrzewski. Verification of performance of aneural network estimator. In Proc. of the 2002 Inter-national Joint Conference on Neural Networks, vol-ume 3, pages 2632–2637, 2002.

109. R. R. Zakrzewski. Randomized approach to verifica-tion of neural networks. In Proc. of the IEEE Inter-national Joint Conference on Neural Networks, vol-ume 4, pages 2819–2824, 2004.

110. Xiaodong Zhang, Matthew Clark, Kudip Rattan, andJonathan Muse. Controller Verification in AdaptiveLearning Systems Towards Trusted Autonomy. InProc. of the ACM/IEEE 6th International Conference

on Cyber-Physical Systems, pages 31–40, 2015.111. H. Zhu, K. Yuen, L. Mihaylova, and H. Leung.

Overview of Environment Perception for IntelligentVehicles. IEEE Transactions on Intelligent Trans-portation Systems, 18(10):2584–2601, 2017.

112. X. Zou, R. Alexander, and J. McDermid. On the Val-idation of a UAV Collision Avoidance System Devel-oped by Model-Based Optimization: Challenges anda Tentative Partial Solution. In Proc. of the 46th An-nual IEEE/IFIP International Conference on Depend-able Systems and Networks Workshop, pages 192–199, 2016.

113. Xueyi Zou, Rob Alexander, and John McDermid.Safety Validation of Sense and Avoid Algorithms Us-ing Simulation and Evolutionary Search. In Com-puter Safety, Reliability, and Security, pages 33–48.Springer, Cham, 2014.

1 arXiv:1812.05389v1 [cs.SE] 13 Dec 2018

Documents

Transcript of 1 arXiv:1812.05389v1 [cs.SE] 13 Dec 2018