DEPLOYMENT OF MACHINE LEARNING MODELS FOR …

MASTER THESIS IN ADVANCED ENGINEERING IN

PRODUCTION, LOGISTICS AND SUPPLY CHAIN MANAGEMENT

DEPLOYMENT OF MACHINE LEARNING

MODELS FOR PREDICTIVE QUALITY IN

PRODUCTION

AUTHOR: HENRIK HEYMANN

SUPERVISOR: ANDRES BOZA GARCIA

EXTERNAL SUPERVISOR: MAIK FRYE

Academic year: 2020-21

Abstract

Abstract

Assuring production quality is one of the key elements for manufacturing, especially in highly

developed countries. At the same time, machine learning is an emerging subject in

investigation. An important area of application for machine learning is the prediction of quality

in production. In practice, deploying a performant model from the development stage into real

world production is often executed unsuccessfully due to the lack of a clearly structured

methodology covering the whole end-to-end process but also the necessary decisions and

steps in detail.

This thesis aims to provide a methodology for machine learning model deployment applied to

the context of predictive quality based on data collected during the production process. To

facilitate predictive quality under consideration of the company’s specific needs and

restrictions, the methodology shall serve as a guideline during the selection process of the

most adequate deployment option.

In order to achieve the goal, a review of academic and gray literature identifies available

options and concepts for deployment. Based on the review, a methodology which analyzes

and structures the possible solutions is developed. For validating purposes, the methodology

is discussed with experts and a use case of a machine learning model from a real-world

manufacturing process is implemented.

The developed methodology provides a clear structure and gives an overview of decisions

and tasks that need to be made for the deployment of machine learning models for predictive

quality in production. Further research could deep dive into individual phases of the

methodology such as the implementation with a software engineering focus.

Keywords: Machine Learning; Predictive Quality; Production Quality

Resumen

Resumen

Asegurar la calidad de la producción es uno de los elementos clave para la fabricación,

especialmente en los países altamente desarrollados. Al mismo tiempo, el aprendizaje

automático es un tema emergente en la investigación. Un área de aplicación importante para

el aprendizaje automático es la predicción de la calidad en la producción. En la práctica, el

despliegue de un modelo de alto rendimiento desde la fase de desarrollo hasta la producción

en el mundo real se ejecuta a menudo de forma infructuosa debido a la falta de una

metodología claramente estructurada que cubra todo el proceso de principio a fin, así como

las decisiones y pasos necesarios en detalle.

Este trabajo final de máster tiene como objetivo proporcionar una metodología para el

despliegue de modelos de aprendizaje automático aplicada al contexto de la predicción de la

calidad en base a datos recogidos durante el proceso de producción. Para facilitar la calidad

predictiva teniendo en cuenta las necesidades y restricciones específicas de la empresa, la

metodología servirá de guía durante el proceso de selección de la opción de despliegue más

adecuada.

Para lograr el objetivo, una revisión de la literatura académica y gris identifica las opciones y

conceptos disponibles para el despliegue. A partir de la revisión, se desarrolla una

metodología que analiza y estructura las posibles soluciones. Para validar la metodología,

se discute con expertos y se implementa un caso de uso de un modelo de aprendizaje

automático de un proceso de fabricación del mundo real.

La metodología desarrollada proporciona una estructura clara y ofrece una visión general de

las decisiones y tareas que deben realizarse para el despliegue de modelos de aprendizaje

automático para la predicción de la calidad en la producción. Futuras investigaciones

podrían profundizar en fases individuales de la metodología, como la implementación con un

enfoque de ingeniería de software.

Palabras Clave: aprendizaje automático; predicción de la calidad; calidad de la producción

I Table of Contents i

I Table of Contents

I Table of Contents ......................................................................................................... i

II Abbreviations ...............................................................................................................iv

III List of Figures ...............................................................................................................v

IV List of Tables .............................................................................................................viii

1 Introduction ...................................................................................................................1

1.1 Initial Situation and Motivation .................................................................................1

1.2 Objective .................................................................................................................2

1.3 Structure .................................................................................................................2

2 Problem in Practice ......................................................................................................4

2.1 Current State of Deploying ML Models in Practice ..................................................4

2.2 Main Challenges for Deployment of ML Models in Practice .....................................4

2.3 Need for Investigation .............................................................................................6

3 Theoretical Fundamentals ...........................................................................................7

3.1 Quality Management ...............................................................................................7

3.1.1 Predictive Quality ............................................................................................7

3.1.2 Exemplary Use Cases in Practice ...................................................................8

3.2 Machine Learning (ML) ...........................................................................................9

3.2.1 Definition .........................................................................................................9

3.2.2 Life Cycle of ML Projects ...............................................................................11

3.2.3 Data Preparation and Modeling .....................................................................12

3.2.4 Evaluation and Deployment ...........................................................................16

3.3 Software Engineering ............................................................................................18

3.3.1 Traditional Software ......................................................................................18

3.3.2 ML Software ..................................................................................................21

I Table of Contents ii

4 State of the Art ............................................................................................................22

4.1 Definition of Evaluation Criteria .............................................................................22

4.2 Literature Review ..................................................................................................24

4.2.1 Step 1: Accumulation and Selection of Publications ......................................24

4.2.2 Step 2: Categorization of Publications ...........................................................28

4.2.3 Step 3: Evaluation of Publications .................................................................29

4.3 Most Relevant Approaches ...................................................................................33

4.4 Theory Deficit ........................................................................................................39

5 Outline of the Methodology........................................................................................40

5.1 Requirements........................................................................................................40

5.1.1 Content Requirements ..................................................................................40

5.1.2 Formal Requirements ....................................................................................40

5.2 Scope ...................................................................................................................41

5.3 Reference Framework ...........................................................................................42

6 Development of the Methodology..............................................................................44

6.1 Deployment Design ...............................................................................................45

6.1.1 Pre-considerations: Design Requirements.....................................................45

6.1.2 Architecture Patterns .....................................................................................49

6.2 Productionizing & Testing .....................................................................................52

6.2.1 Pre-considerations: Environments .................................................................52

6.2.2 Implementation Steps ....................................................................................53

6.3 Monitoring .............................................................................................................60

6.3.1 Pre-considerations: ML Model Decay ............................................................60

6.3.2 Monitoring Levels ..........................................................................................61

6.4 Retraining .............................................................................................................63

6.4.1 Pre-considerations: Retraining Effect ............................................................64

6.4.2 Retraining Decisions .....................................................................................64

6.5 General Aspects for Deployment...........................................................................65

6.5.1 Roles and Competencies ..............................................................................65

6.5.2 Tools and Frameworks ..................................................................................67

I Table of Contents iii

7 Verification and Validation .........................................................................................69

7.1 Verification ............................................................................................................69

7.2 Validation ..............................................................................................................70

7.2.1 Expert Interviews ...........................................................................................70

7.2.2 Practical Application ......................................................................................71

8 Conclusion ..................................................................................................................77

V Bibliography ................................................................................................................79

VI Budgeting ....................................................................................................................91

VII Appendix .....................................................................................................................94

II Abbreviations iv

II Abbreviations

Abbreviation Description

AI Artificial Intelligence

CD4ML Continuous Delivery for Machine Learning

CI/CD Continuous Integration/ Continuous Delivery

CPU Central Processing Unit

CRISP-DM Cross Industry Standard Process for Data Mining

GPU Graphics Processing Unit

IoT Internet of Things

IT Information Technology

KDD Knowledge Discovery in Databases

ML Machine Learning

OS Operating System

QA Quality Assurance

SEMMA Sample, Explore, Modify, Model and Assess

III List of Figures v

III List of Figures

Figure 1.1: Structure of the work based on applied research according to Ulrich ....................2

Figure 3.1: AI, ML, deep learning and data science based on Kotu and Deshpande (2019,

p. 3) ................................................................................................................................9

Figure 3.2: Traditional program and machine learning (Kotu & Deshpande, 2019, p. 3) .......10

Figure 3.3: CRISP-DM (Chapman et al., 2000, p. 10) ...........................................................12

Figure 3.4: Binary and multiclass classification, regression and clustering (Singh, 2021,

pp. 8–10) ......................................................................................................................14

Figure 3.5: Machine learning types according to en.proft.me (2015).....................................14

Figure 3.6: Visualization of reinforcement learning (Singh, 2021, pp. 12–13)........................15

Figure 3.7: Typical Steps in an ML Pipeline based on Galli (2020) .......................................18

Figure 3.8: Implementation steps to develop a large computer program (Royce, 1970) ........19

Figure 3.9: Software development life cycle according to bigwater.consulting ......................19

Figure 3.10: Scrum framework according to Scrum.org ........................................................20

Figure 3.11: DevOps approach according to Harlann (2017) ................................................20

Figure 3.12: Continuous integration, delivery and deployment according to Pennington

(2019) ...........................................................................................................................21

Figure 3.13: MLOps (Neal Analytics, 2020) ..........................................................................21

Figure 4.1: Connected papers to Sculley et al. (2015) from connectedpapers.com ..............27

Figure 4.2: Year distribution of selected publications (Total of 46) ........................................28

Figure 4.3: Type distribution of selected publications (Total of 46)........................................28

Figure 4.4: Illustration of the process chain (Krauß et al., 2020) ...........................................33

Figure 4.5: Predictive model-based quality inspection framework (J. Schmitt et al., 2020) ...34

Figure 4.6: ML Code as small fraction of ML systems (Sculley et al., 2015) .........................34

III List of Figures vi

Figure 4.7: Traditional system and ML-based system testing and monitoring (Breck et al.,

2017) ............................................................................................................................35

Figure 4.8: Continuous delivery for ML end-to-end process ..................................................36

Figure 5.1: AutoML pipeline in the context of production based on Krauß ............................43

Figure 6.1: Overview of methodology ...................................................................................44

Figure 6.2: Morphological box for deployment design ...........................................................46

Figure 6.3: Cloud service levels based on Watts and Raza (2019) and Chen (2020) ...........49

Figure 6.4: Common architecture patterns in practice ...........................................................50

Figure 6.5: Environments for ML model development and ML software development...........53

Figure 6.6: Sequence of implementation steps .....................................................................53

Figure 6.7: GitHub vs GitLab workflow (GitLab, 2021) ..........................................................55

Figure 6.8: Procedural programming vs pipeline structure ....................................................56

Figure 6.9: Bare metal, virtual machines, and containers based on Kominos et al. (2017)....57

Figure 6.10: Testing pyramid ................................................................................................58

Figure 6.11: Degrees of Automation based on Chigira (2019) ..............................................60

Figure 6.12: Levels of monitoring..........................................................................................61

Figure 6.13: Analysis flow chart ............................................................................................63

Figure 6.14: Impact of refreshing on model quality based on Thomas and Mewald (2019) ...64

Figure 6.15: Retraining decisions .........................................................................................64

Figure 6.16: Collaboration between process, data science and DevOps competence ..........66

Figure 6.17: Maturity model dimensions based on Hornick (2018)........................................66

Figure 7.1: Webservice architecture for use case .................................................................72

Figure 7.2: Screenshot of home page of the webservice ......................................................72

III List of Figures vii

Figure 7.3: Screenshot of prediction input ............................................................................73

Figure 7.4: Screenshot of prediction output ..........................................................................73

Figure 7.5: Screenshot of monitoring ....................................................................................74

IV List of Tables viii

IV List of Tables

Table 3.1: Definitions and terminology (Mohri et al., 2018, pp. 4–5) .....................................13

Table 3.2: Confusion matrix (Harrington, 2012, p. 144) ........................................................16

Table 3.3: Classification metrics with formula (Flach, 2012, pp. 53–61; Harrington, 2012,

p. 24; Sarkar et al., 2018, p. 12) ....................................................................................17

Table 4.1: Evaluation criteria ................................................................................................22

Table 4.2: Search strings ......................................................................................................25

Table 4.3: Evaluation of existing approaches .......................................................................30

Table 4.4: Four potential ML system architecture approaches ..............................................37

Table 6.1: Evaluation of architectures ...................................................................................52

Table 6.2: Tests according to Sato et al. (2019) ...................................................................58

Table 6.3: Pros and cons of open-source and closed-source tools (Matteson, 2018) ...........67

Table 7.1: Evaluation of developed methodology .................................................................70

1 Introduction 1

1 Introduction

According to Andrew Ng, an adjunct professor of computer science at Stanford University

and one of the leading personalities in artificial intelligence (AI), there is still enormous

potential to be exploited in many economic sectors (Johnson, 2019):

“I think the next massive wave of value creation will be when you can get a

manufacturing company or agriculture devices company or a health care

company to develop dozens of AI solutions to help their businesses.”

In this introductory chapter, the initial situation and motivation for implementing these

solutions in manufacturing companies are presented. Furthermore, the objective is

formulated followed by the description of the structure of this thesis.

1.1 Initial Situation and Motivation

Machine learning (ML) as one subdomain of AI is an emerging subject in investigation

(Perrault et al., 2019). Main reasons for the growth in the adoption of ML and AI by

businesses are the rise in data, increased computational efficiency, improved ML algorithms,

and availability of data scientists (Singh, 2021, p. 3). With regard to the manufacturing sector,

ML is a useful technique to predict the quality in the production. Companies from

industrialized nations must be able to manufacture products of superior quality at competitive

costs to ensure their competitiveness in a globalized world (National Research Council,

1995, p. 1). Initial applications show that the importance of using ML models for quality

predictions has already been detected (Brosset et al., 2019).

Singh (2021, p. 53) states that ML and AI “are not a silver bullet that can solve all problems”.

The author of this statement refers to the fact that the implementation of ML does not work

without any significant investments including the necessary effort associated with the

deployment. This very step of integrating the model into the running process is a crucial

factor for the success of an ML project as a true benefit for a company is only generated by

making the predictions available to the appropriate users in production (Odegua, 2020).

Currently, transferring a performant ML model from the development stage into real world

production is often conducted in an unstructured and unsubstantiated manner, which cannot

ensure to deliver the best solution for every specific situation. This is caused by the huge

variety of different concepts available for deployment. Due to the field’s novelty and dynamic,

capturing the whole landscape of offered tools results to be extremely difficult (Turck, 2020).

Complicating matters further, decision owners in the top management of companies tend to

lack in-depth knowledge in data science and related fields such as software engineering

(Salminen et al., 2017). Consequently, a structured and well-founded procedure for

deploying ML models is lacking which describes the deployment process in breadth and

depth and assists with the implementation of an ML solution from start to finish.

1 Introduction 2

1.2 Objective

For these reasons, this thesis aims to provide a methodology for ML model deployment

applied to the context of predictive quality in production. Within this field of application, the

focus is set on ML models which ingest tabular data from manufacturing processes to make

inferences on the output quality. To facilitate predictive quality under consideration of the

company’s specific needs and restrictions, the methodology shall serve as a guideline during

the selection process of the most adequate deployment option.

Therefore, the following research question can be formulated based on the defined goal and

is to be answered in this thesis.

1.3 Structure

Structurally, this work is based on the concept for applied research in theory and practice by

Ulrich et al. (1984, p. 193). Figure 1.1 links the seven phases of the applied research

methodology with the eight resulting chapters of this work.

Figure 1.1: Structure of the work based on applied research according to Ulrich

In chapter 1, the relevance of the topic is exposed, the objective is defined, and the

overarching research question is established. This introduction is followed by chapter 2 with

a more detailed description of the problem and its corresponding challenges encountered in

practice. These first two chapters serve to identify relevant problems from practice.

2. Identification and interpretation of

problem-relevant theories and hypotheses

3. Collection and specification of problem-

relevant procedures of the formal sciences

4. Capture and study of the relevant

application context

5. Derivation of assessment criteria, design

rules and models

6. Testing the rules and models in the

context of application

7. Consulting and implementation in

practice

1. Identification and typification of problems

relevant to practice

Applied research in theory and practice

according to Ulrich

Chapter 3 Theoretical Fundamentals

Chapter 4 State of the Art

Chapter 5 Outline of the Methodology

Chapter 6 Development of the Methodology

Chapter 7 Verification and Validation

Chapter 8 Conclusion

Chapter 1 Introduction

Chapter 2 Problem in Practice

Structure of the work

Pra

ctica

l Ap

plic

atio

n

How to deploy ML models for predictive quality in production?

1 Introduction 3

Problem-relevant theories are identified in chapter 3 and 4. On the one hand, chapter 3 lays

the theoretical foundation for the further course of this work by introducing fundamental

concepts of quality management, ML, and software engineering. Chapter 4, on the other

hand, evaluates existing approaches by means of an analysis of the state of the art.

Subsequently, the methodology is outlined in chapter 5 by defining the requirements, the

scope, and the reference framework. The state of the art and outline of the methodology

have in common to collect problem-relevant procedures from formal sciences as well as

capture the application context. After defining the specifications of the concept, the

subsequent chapter 6 is devoted to elaborating the methodology for deploying ML models for

predictive quality in production. Both chapters 5 and 6 comprise the derivation of assessment

criteria, design rules, and models for the methodology.

In chapter 7, the generated solution approach to answer the research question is verified and

validated including the practical implementation in order to test the model in the context of

application. Finally, chapter 8 summarizes and critically reflects the results with regard to the

implications on consulting and implementation in practice before providing a short outlook on

further research.

2 Problem in Practice 4

2 Problem in Practice

This chapter adds more detail to the introductory motivation by analyzing the current state of

the deployment of ML models in production. Furthermore, associated challenges which

companies are facing when confronted with deploying ML models for practical applications

are addressed. In the following, ML and AI solutions are analyzed as one combined topic.

2.1 Current State of Deploying ML Models in Practice

Studies deliver evidence about the actual situation of deployments of AI applications in

companies. Underlying all surveys is the high level of interest in capitalizing on AI, which is

expected to change businesses fundamentally by contributing up to $15.7 trillion to the global

economy by the year 2030 (Rao et al., 2019).

As the concluding results of various studies regarding ML model deployment in practice,

many ML projects never get deployed and only a very small percentage of ML models make

it to production (Gonfalonieri, 2019). Enterprises are discovering it is easier to build a model

than it is to integrate it into existing processes, indicating that even if the model development

phase was successful, the most difficult part is yet to come. As a result of these so-called

last-mile deployment problems, most companies deploy only between 10 % and 40 % of their

ML projects depending on their size and technology readiness (Lawton, 2020). Out of all

pursued projects, 78 % are shut down before reaching the deployment stage (Singh Bisen,

2019). Further sources report that 87 % of data science projects never make it into

production (Larsen, 2019). Even without trying to find the most accurate percentage of

successfully deployed ML models, it becomes evident that the deployment is being executed

in an insufficient manner in most cases.

Resources are wasted for unsuccessful deployments efforts which include not deploying ML

models at all or failing to bring them properly into use. In either case, time and money are

spent with never gaining any profit from using the model. Focusing on surveyed companies

which did manage to deploy a model, just about half of them say they spend between 8 and

90 days deploying one model. 18 % of companies are taking longer than 90 days with some

spending more than a year productionizing. Transferred into actual time spent on

deployment, at least 25 % of data scientist time is spent deploying models (Algorithmia,

2019). As a takeaway, any company wanting to reap the potential of ML needs to prioritize

on efficiently carrying out the deployment.

2.2 Main Challenges for Deployment of ML Models in Practice

After seeing the evidence of failed projects not being the exception but the usual case in

practice, it is necessary to analyze the underlying causes for the problem. Reasons for

ineffective deployments can have an organizational and/ or technological origin (Baier et al.,

2019). A selection of the most critical factors is explained in the following.


High Set-Up and Operation Effort

Setting up an ML system imposes technological challenges such as CPU and memory

usage, scalability, portability and traceability (Gonfalonieri, 2019; Shaik, 2019). Companies

need IT infrastructure that can maintain high availability in order to accommodate spikes in

demand for the ML model (Decosmo, 2019). All these factors, among many others, need to

be taken into account when selecting platforms and tools (Druzkowski, 2017). The range of

offered services comprises standardized programs by big established technology companies

as well as specialized tools by start-ups. Due to the dynamic of the market, the number of

different offerings will increase even more over time (Turck, 2020).

Setting up the system does not only concern technologies, but also the integration of ML

models into the business application (Shaik, 2019). The biggest AI deployment impediment

for most companies consists in providing the infrastructure for connecting AI into the

business. Only if AI is adopted all the way down to the end user, it unfolds its real business

value (Lawton, 2020). Business users need understand and trust ML models when using

predictions in their decision-making process so that a model should be developed for one

specific task to enhance an existing process or solve a well-defined business problem. In

doing so, the idea is to keep it simple and not expect too much too quickly (Decosmo, 2019).

In comparison to the operation of regular software, ML applications require more frequent

actualizations and must be monitored continuously (Shaik, 2019). Monitoring includes

observing the ML model’s performance and watching out for gaps between training and real-

world data. An ML model’s accuracy will be at its best until starting to use it. It is hard to build

ML models that reflect future, unseen behavior if that behavior evolves quickly. A deployed

model interacts with the real world and, thus, changes in the real world can break the feature

a model depends on or can make the prior distributions a model was trained on obsolete

(Talby, 2019).

Missing Coordination and Support

From a purely organizational point of view, the main challenge is the missing alignment

between roles. There might be a lack of understanding of the business problem by the

analyst or too complex models for implementation (Shrivastava, 2016). Therefore, it is

necessary to build diverse teams that include people who have business, IT and specialized

AI skills (Rao et al., 2019). A team should have both practical software development and

model-building experience as many data scientists only have academic experience in

building ML models and lack practical experience in deploying them (Decosmo, 2019). In

order to leverage AI projects successfully, coordination within the team but also across

hierarchies needs to be considered. A project’s success depends on the leadership support

by business leaders and the communication with the decision owners (Larsen, 2019).


2.3 Need for Investigation

All the previously described challenges are already difficult enough, but still companies tend

to underestimate the deployment and the maintenance of ML solutions (Shaik, 2019). As an

interim conclusion, the challenging task of deploying ML models cannot be considered in an

isolated way as there are many different relevant factors influencing the process. Not only

the technological requisites of the specific use case, but also organizational and structural

components within in the company can have a substantial impact. Evidence from studies

shows the increasing relevance of the topic and reveal the existing deficit in practice which

serves as a starting point for developing a methodology. The identified challenges are

addressed and picked up in the development phase.

3 Theoretical Fundamentals 7

3 Theoretical Fundamentals

In this chapter, basic theoretical concepts regarding quality management, machine learning

and software engineering are presented. These theoretical foundations do not yet provide

solutions to the formulated problem but give a contextual understanding of deploying ML

models for predictive quality.

3.1 Quality Management

First, basic concepts from quality management are presented in order to define predictive

quality and show exemplary applications from practice.

3.1.1 Predictive Quality

In general, data has become more and more important in the field of quality and is used to

make decisions for different types of quality-related tasks. As a reference to Industry 4.0, the

term “Quality 4.0” describes the systematic and goal-oriented usage of all available data to

improve quality. Approaches for data-based quality regulation use data mining respectively

ML methods to optimize the processes in order to achieve a demanded product quality (Ngo

& Schmitt, 2016). Predictive quality is defined as “the empowerment of the user to optimize

product and process-related quality by using data-driven predictions as a basis for decision-

making and action” (R. H. Schmitt et al., 2020).

A terminological distinction between predictive quality and predictive maintenance is

necessary. Predictive maintenance is defined as regular monitoring of the operating

condition of production equipment and aims to ensure the maximum interval between repairs

and minimize the number and cost of unplanned interruptions in production. Process-related

indicators are captured to determine the actual operating condition of critical plant systems

and to schedule maintenance activities according to the obtained data. Successfully

executed predictive maintenance activities improve product quality, productivity, operating

efficiency and profitability of production plants (Mobley, 2002, pp. 4–6).

Predictive quality can enable various potentials in practice ranging from the analysis of past

defects to the prediction of future events and the derivation of remedial measures. This can

be achieved through the use of simple statistical methods or complex ML models. The need

for a data-driven predictive approach is caused by the increasing complexity in production

processes, the rising number of immanent interactions between individual processes and a

significant increase in process variance due to the increasing individualization of products.

As a main objective, adequate measures shall be derived from the prediction to optimize the

quality (R. H. Schmitt et al., 2020).

For the purpose of this thesis, predictive quality is defined as the activity of making

predictions about the quality of a product. In contrast, ensuring aspects such as stable

processes, efficient process chains and the fulfillment of requirements by products fall under

the term of production quality.


3.1.2 Exemplary Use Cases in Practice

The following use cases show examples from real-life production, where either a generic

form of data analysis or specifically ML methods find use to predict the product quality.

Rejects Forecasting in Production Chains

Wasting resources on rejected products gets more expensive along the production chain of

lamps for automotive lighting and LED components. To reduce the reject rate in the last

manufacturing step, a forecasting model for predicting rejects is trained on manufacturing

parameters and inspection data, with the help of which the main influencing variables can be

identified and initial recommendations for action can be derived (R. H. Schmitt et al., 2020).

ML to Predict Product Quality and Geometry in Circular Laser Grooving Process

In the process of circular laser grooving, achieving the desired micro grooves on the

circumference of cylindrical parts depends on the appropriate selection of process

parameters such as workpiece rotational speed or laser power and frequency. A random

forest algorithm is used to derive the most influential input parameters on the outputs with

respect to product quality (Zahrani et al., 2020).

Laser Cutting Process

A laser cutting process is made up of two parallel running sub-processes: guiding of the laser

beam and simultaneously moving the work piece. The final cutting shape results, which

define the product quality, not also depend on the two sub-processes but can also be

influenced by a previous grind process. As there is no simple, direct chain of effect between

the processes, data mining methods are applied to analyze the complex interdependencies

between the processes (Ngo & Schmitt, 2016).

Quality Improvement of Milling Process

Forecasting vibrations during milling of components for the aerospace industry is achieved

through the implementation of ML algorithms that predict critical process conditions. Very

high requirements for the product quality with regards to surface roughness or dimensional

deviations must be met and can be accomplished by adjusting machining parameters in

advance to avoid critical conditions of the process (Frye & Schmitt, 2019).

Preventive Quality Assurance in Clothing Industry

ML-based systems are not only used in the production process of metal parts to predict the

product quality, but also find application in the production of other goods such as textile. In

cooperation with a German clothes manufacturer, an algorithm was trained for the purpose of

preventive quality assurance which automatically feeds insights about failure rates into the

design process, without the need for any manual data analysis (Nalbach et al., 2018).


In this thesis, predictive quality for a generic production process is considered. On the basis

of measured values, an ML model shall predict if a product will pass the quality control at a

certain stage in production by analyzing the sensor data from the production process itself.

Therefore, each product is to be categorized into classes such as “pass” and “fail” or “ok” and

“not ok”. For a higher level of detail, a model might categorize items in even more than two

classes, e.g., into three groups such as “parts ok”, “rework needed”, “scrap”. As described

before, applications for predictive maintenance and production quality might work similarly to

predictive quality but still need to be distinguished as they all fulfill different tasks and pose

different requirements.

3.2 Machine Learning (ML)

Machine Learning (ML) is a huge field of investigation with many authors covering the topic

in depth. For the purpose of deploying ML models and not building them, it is necessary to

achieve a basic understanding of what ML is and how ML models work.

3.2.1 Definition

First of all, ML must be defined and differentiated from other fields of study. Frequently ML

appears in connection with Artificial Intelligence (AI), Deep Learning and Data Science. AI

serves as an umbrella term for ML and Deep Learning as visualized in Figure 3.1 and is

defined as giving computers the capability of mimicking human behavior, particularly

cognitive functions. Techniques such as robotics, synthetic language and cognitive vision

pertain to AI (Kotu & Deshpande, 2019, pp. 2–3).

Figure 3.1: AI, ML, deep learning and data science based on Kotu and Deshpande (2019, p. 3)

Artificial Intelligence

Machine Learning

Deep

Learning Data

Science


ML as a sub-field or tool of AI covers techniques which give computers the ability to learn

from experience in form of data without being explicitly programmed to do so (Kotu &

Deshpande, 2019, pp. 2–3). The difference between traditional and ML programs is depicted

in Figure 3.2.

Figure 3.2: Traditional program and machine learning (Kotu & Deshpande, 2019, p. 3)

While traditional programming is rule-based, ML aims to learn inherent patterns (Sarkar et

al., 2018, pp. 5–7). The automated detection of meaningful patterns in data is referred to as

learning (Shalev-Shwartz & Ben-David, 2019, p. vii). The definition of learning goes beyond

memorizing past data and describes converting experience into expertise. Gained knowledge

enables broader generalization which means to apply expertise gained from known

examples to unseen data in order to make a prediction (Shalev-Shwartz & Ben-David, 2019,

pp. 19–20). Learning in ML unlike in psychology, cognitive science, or neuroscience does not

aim to understand the learning processes in humans and animals but aims to build a useful

system (Alpaydin, 2014, p. 14). As a condensation of the fairly technical definition originally

formulated in 1997 by Mitchell, ML describes learning from experience. This experience is

the input data for the learning algorithm (Shalev-Shwartz & Ben-David, 2019, pp. 19–20).

Algorithms are computational methods using experience to improve performance or to make

accurate predictions (Mohri et al., 2018, p. 1).

As indicated in Figure 3.1, ML and data science show an overlap. ML and data science

methods are used to extract value from data (Harrington, 2012, p. 5; Kotu & Deshpande,

2019, p. 3). What ML has in common with the fields of statistics, operations research and

management information systems is the goal to make data-driven decisions, with the

difference that these fields do not consider reasoning or intuition (Sarkar et al., 2018, pp. 4–

5). Among others, further connected fields are mathematics, data mining and computer

science (Singh Bisen, 2019, p. 13). Deep Learning describes a subset of ML which makes

the computation of multi-layer neural networks feasible (Jeffcock, 2018). Regarding their

operating principle neural networks deviate strongly from common ML algorithms and, thus,

form a separate category.

Traditional

Program

Machine

Learning

Input (X) Output (Y)

Input (X)

Output (Y)

Representative model

of the program


Primarily, ML aims to gain insight from data (Harrington, 2012, p. 5). In addition to

understanding available data, its goal is to generate accurate predictions for unseen items by

designing efficient and robust algorithms to produce these predictions even for large-scale

problems (Mohri et al., 2018, p. 3). This is fueled by the need of making data-driven

decisions at scale (Sarkar et al., 2018, p. 4).

Implementing ML is especially beneficial when there is no exact model available and useful

approximations can only be made based on existing and accessible data (Alpaydin, 2014,

pp. 1–2). Furthermore, domain specific problems with a lack of human expertise or problems

at scale with huge volumes of data with too many complex conditions and constraints are

predestined for the utilization of learning methods. ML is also suitable for environments with

continuously changing behavior as well as for conditions where formally explaining or

translating human expertise into computational tasks, e.g. speech recognition, proves to be

difficult (Sarkar et al., 2018, p. 9).

Real-life applications of ML extend across many fields such as retail, finance and

manufacturing (Alpaydin, 2014, p. 3). In retail, possible use cases for ML include

personalized recommendations of products in online shops. Forecasting of stock market or

detection of fraud are examples from a finance point of view. When looking at use cases

from manufacturing, ML can be used to detect failures and defects in production (Sarkar et

al., 2018, p. 65). In the sector of automobile manufacturing, a common use case consists in

predictive maintenance (Singh, 2021, pp. 48–52). AI solutions not only find application in the

manufacturing process itself, but also in the supply chain planning (Rodríguez et al., 2020).

3.2.2 Life Cycle of ML Projects

There are different frameworks available for managing the life cycle of ML projects.

Knowledge discovery in databases, known as the KDD process model, covers the steps

selection, preprocessing, transformation, data mining and interpretation/ evaluation. A

second framework is SEMMA which is an acronym that stands for the steps sample, explore,

modify, model, and assess. Finally, the so-called cross industry standard process for data

mining (CRISP-DM) provides a standardized and generally valid procedure (Azevedo, 2008).

Approaches for KDD and data mining can be applied to structure ML projects as these terms

are used in computer science to describe the same methods which are used for ML

(Alpaydin, 2014, p. 3, 2014, p. 16).

Out of the introduced process models, CRISP-DM is still the most popular framework for

executing data science projects (Saltz, 2020). It is composed of the steps business

understanding, data understanding, data preparation, modeling, evaluation and deployment

(see Figure 3.3). The arrows in the graphic illustrate that each phase must not be analyzed in

an isolated manner due to the dependencies between phases and the methodology’s cyclical

nature.


Figure 3.3: CRISP-DM (Chapman et al., 2000, p. 10)

Business understanding focuses on converting business requirements into a specific

problem definition. The data understanding includes gaining insights into the available data

which are necessary for the subsequent data preparation phase during which adequate

transformations on the raw data are executed to obtain the final dataset. In the modeling

phase, different techniques and parameters are applied and tested to create the best

possible model. Once the model is built, an evaluation assesses if the objectives are met.

Finally, deployment describes the integration of the ML model into an organization’s

decision-making processes.

3.2.3 Data Preparation and Modeling

The steps of business understanding and data understanding do not require specific ML

knowledge but depend highly on the use case. However, for data preparation and modeling

technical understanding of ML is necessary. Both activities are closely related activities in the

ML life cycle. Table 3.1 summarizes relevant definitions and terminology for the described

steps.

Each instance of data can be described through a set of attributes, the features. In other

words, features form an instance (Harrington, 2012, p. 8). This means that every instance of

data can be viewed as a vector of feature values. If the input data does not come with built-in

features, they need to be constructed by the developer of the ML application. This feature

construction might be necessary for use cases with instances with no provision of attributes.

Any kind of adjustment of existing features in order to achieve better results by a learning

algorithm is included in feature transformation (Flach, 2012, pp. 38–46). Two approaches to

transform features are feature selection and feature extraction. Feature selection methods

select or discard features to reduce the overall number of features. Feature extraction

methods engineer new features from the existing ones (Sarkar et al., 2018, p. 40). Not all

Business

Understanding

Data

Understanding

Data

Preparation

Modeling

Evaluation

Deployment


available data is used for training an algorithm, but the data set is divided into a training and

test set, typically in an 80 % to 20 % ratio (Géron, 2018, pp. 30–31).

Table 3.1: Definitions and terminology (Mohri et al., 2018, pp. 4–5)

Examples Instances of data

Features Set of attributes

Labels Values or categories assigned to examples

Hyperparameters Free parameters as inputs to the learning algorithm

Training sample Examples used to train a learning algorithm

Validation sample Examples used to tune the parameters of a learning algorithm when

working with labeled data

Test sample Examples used to evaluate the performance of a learning algorithm

Loss function A function that measures the difference, or loss, between a predicted

label and a true label

Tasks are the problems that can be solved with ML (Flach, 2012, p. 13). At Microsoft, an ML

task is defined as the type of prediction being made, based on the available data and the

question that is being asked (Quintanilla et al., 2019). Three of the most common and most

important tasks are classification, regression, and clustering, which all serve different

purposes. While classification predicts a nominal target value, that is to say classes,

regression predicts a continuous value (Harrington, 2012, p. 9). Categorizing an item into

one of two classes is called binary classification, in the case of more than two different

classes the task is referred to as multiclass classification. Clustering uses similarity and

distance to group items. Figure 3.4 illustrates how each one of the mentioned tasks work. For

classification and regression, past data must be labeled, that means the classes respectively

numerical values of previous elements must be known. In contrast, clustering does not need

any additional information in form of a target value or given label from past data (Harrington,

2012, p. 10).


Figure 3.4: Binary and multiclass classification, regression and clustering (Singh, 2021, pp. 8–

10)

ML types help to distinguish tasks for a better understanding of the existing variety in ML

(Figure 3.5). The presented selection of tasks, also including classification, regression, and

clustering, does not provide an exhaustive list of all available tasks in academia, but aims to

give an overview of the most relevant ones.

Figure 3.5: Machine learning types according to en.proft.me (2015)

Supervised learning requires the availability of a target variable or label, whereas

unsupervised learning does not (Mohri et al., 2018, pp. 6–7). A supervised learner is

provided with extra information in form of labels by the environment. In case of classification,

an instance is affiliated to a class through a label, for regression a continuous number for

each instance is given. Unsupervised learning is characterized by a learning algorithm which

processes input data without external supervision (Shalev-Shwartz & Ben-David, 2019,

pp. 22–23). In order to cluster similar items of data, no label or target variable is required.

Semi-supervised learning, as suggested by the name, builds on a partially labeled data set

(Mohri et al., 2018, pp. 6–7). A small, labelled training set is used to build an initial model,

which is then refined using the unlabeled data. This ML type can be useful obtaining labelled

data is associated with high cost (Flach, 2012, pp. 14–20).

Classification Regression Clustering

Binary Multiclass

Supervised

Learning

Unsupervised

Learning

Semi-supervised

Learning

Reinforcement

Learning

Machine Learning Types

Classification Regression

Continuous

target variable

Categorical

target variable

ClusteringDimensionality

ReductionClassification Clustering Classification Control

Target variable

not available

Categorical

target variable

Categorical

target variable

Target variable not available


Reinforcement learning may or may not need a target variable but works in a different matter

than the other three ML types. Visualized in Figure 3.6, an agent with set of strategies or

policies takes action on observing the state of the environment, gets a reward or penalty and

updates the policies (Sarkar et al., 2018, pp. 42–43). In order to reach the goal, the agent

generates a policy through the assessment of past sequences of actions (Alpaydin, 2014,

p. 13).

Figure 3.6: Visualization of reinforcement learning (Singh, 2021, pp. 12–13)

An enormous number of algorithms for different ML tasks can be found in the literature.

Common algorithms for the supervised task of classification are k-nearest neighbors, support

vector machines and decision trees, which all work in a different way (Harrington, 2012,

p. 10). The k-nearest neighbors algorithm classifies data based on distance measurement to

existing data points and, in doing so, looks at the top k most similar pieces of data

(Harrington, 2012, p. 19). Support vector machines aim to separate data with the maximum

margin, in other words, the best separating line is to be found (Harrington, 2012, p. 102).

Decision trees split data sets one feature at a time and have the advantage of being

understandable by humans even without specific ML knowledge (Harrington, 2012, p. 38). In

academia, there are a lot more algorithms that are either less common than the presented

ones or are applied in other use cases such as the Naïve Bayes algorithm using probability

theory in order to classify based on non-numeric, nominal values (Harrington, 2012, p. 62).

Instead of one single model, multiple models in form of an ensemble can be employed.

Combining multiple learners is a strategy to confront the No Free Lunch Theorem which

states that there is no single learning algorithm which in any domain always induces the most

accurate learner (Alpaydin, 2014, p. 487; Flach, 2012, p. 330).

Regarding the learning protocol, there is online and batch learning. Online learning means

that the learner has to respond online, throughout the learning process, so there is no

separation between the training phase and prediction phase (Shalev-Shwartz & Ben-David,

2019, p. 24). Online learning is also referred to as incremental learning as the model

continuously learns with new data (Flach, 2012, p. 361). In batch learning scenarios, the

model is trained on large amounts of training data at once before making predictions (Sarkar

et al., 2018, pp. 43–44).

Agent

Environment

RewardState Action


3.2.4 Evaluation and Deployment

The performance is usually a quantitative measure or metric which is used to see how well

the algorithm or model is performing the task with experience (Sarkar et al., 2018, p. 12).

Binary classification has already been identified as the adequate task for predictive quality.

Thus, the focus lies on performance measures for this specific application area while metrics

for further tasks such as regression are not covered.

Table 3.2 depicts the so-called confusion matrix which juxtaposes the actual classes and the

ones predicted by the ML model. Two classes are differentiated: positive and negative. True

positives, short TP, are all elements that are actually positive and are identified correctly as

positive by the model. Following the same logic, true negatives (TN) belong to the actual

negative class and are rightly identified as negative. False negatives (FN) and false positives

(FP) are wrongly classified as negative respectively positive.

Table 3.2: Confusion matrix (Harrington, 2012, p. 144)

Actual

Positive Negative

Predicted

Positive True Positive (TP) False Positive (FP)

Negative False Negative (FN) True Negative (TN)

The performance of a classifier can typically be measured through a set of different metrics

shown in Table 3.3 including accuracy or error rate with the sum of accuracy and error rate

equaling exactly 1. In addition to accuracy and error rate, typical performance measures for

classification include precision, recall respectively sensitivity, specificity, and F1-score. All of

them range between 0 and 1 with 1 being the perfect score. Using a single indicator as the

objective function for the optimization of an algorithm’s parameters may lead to undesired

results. Therefore, the F1-score combines precision and recall into one metric.


Table 3.3: Classification metrics with formula (Flach, 2012, pp. 53–61; Harrington, 2012, p. 24;

Sarkar et al., 2018, p. 12)

Metric Verbally Described Formula Mathematical Formula

Accuracy = number of correct predictions

total number of predictions =

TP + TN

TP + TN + FP + FN

Error rate = number of wrong predictions

total number of predictions =

FP + FN

TP + TN + FP + FN

Precision = true positives

predicted positive results =

TP

TP + FP

Recall, Sensitivity = true positives

actual positive results =

TP

TP + FN

Specificity = true negatives

actual negative results =

TN

TN + FP

F1-Score = 2 ∗precision ∗ recall

precision + recall =

TP

TP +12 (FP + FN)

Instead of the standard metrics, it is also possible to use a modified personalized cost

function to adjust the weights to, e.g., penalize wrong classification (Sarkar et al., 2018,

p. 12). Nonetheless, the success of an ML algorithms is subject to the available data (Mohri

et al., 2018, p. 1). The fact that a model’s performance is always only as good as the data is

expressed by Sarkar et al. (2018, p. 44) through the following phrase: “Garbage in, garbage

out.”

The performance is not only limited by the data quality but also by bias which describes prior

knowledge respectively prior assumptions when building the model so that it influences the

performance on the task (Shalev-Shwartz & Ben-David, 2019, p. 60). Performance measures

allow to identify deviations from the optimum model complexity (Singh, 2021, p. 15). The

trade-off between the sample size and complexity plays a critical role in generalization. When

the sample size is relatively small, choosing a too complex algorithm may lead to poor

generalization, which is also known as overfitting. On the other hand, with a too simple

algorithm it may not be possible to achieve a sufficient score, which is known as underfitting

(Mohri et al., 2018, p. 8).

If the evaluation phase of the life cycle is passed successfully, the deployment is realized.

Generally, deployment is defined as “the action of bringing resources into effective action”

(Oxford University Press, 2020). When talking about the deployment in context of ML,

different definitions of deployment exist. Singh (2021, p. 57) defines the task of deployment

as integrating the ML model into an existing business application, which coincides with the


deployment’s definition in CRISP-DM. In an alternative definition by Galli (2020), the

deployment of ML models refers to making the models available in a production environment

in order to provide predictions to other software systems and clients. In this thesis, the goal

of deployment is defined as follows:

When it comes to deployment, not only the ML models but the whole ML pipeline is to be

deployed. An ML pipeline encompasses all the steps required to get a prediction from data

with the ML algorithm only being one component of said pipeline (Figure 3.7).

Figure 3.7: Typical Steps in an ML Pipeline based on Galli (2020)

The goal of building an ML model is to solve a problem and an ML model can only do so

when it is in production and actively in use by consumers (Singh, 2021, p. 58). To maximize

the value of any ML model, the capability to extract predictions reliably and share with other

systems must be enabled. As such, model deployment is as important as model building.

3.3 Software Engineering

After creating an ML model for a use case in quality management, the deployment of the

created models requires an understanding of principles from the field of software

engineering. According to IEEE Computer Society, software engineering is defined as the

“application of a systematic, disciplined, quantifiable approach to the development, operation,

and maintenance of software”. First, the focus is set on traditional software and then on the

characteristics of software in combination with ML.

3.3.1 Traditional Software

The development process of traditional software is described followed the relevant aspects of

collaboration and automation during all three phases development, operation and

maintenance.

Development

Initial approaches to software development date back to 1956, when the nine phase stage-

wise model was introduced by Benington. Based on this, Royce presented his proposal for

managing the development of large software systems in 1970 (see Figure 3.8). Although the

term itself was not used by the author, the approach gained popularity under the name of

waterfall model and is considered the traditional approach to software development.

Data OutputFeature

Engineering

Feature

Selection

ML

Algorithm

ML Pipeline

Making a resulting model available in a specific environment

in order to make the results usable where they are needed.


Figure 3.8: Implementation steps to develop a large computer program (Royce, 1970)

As evolution continued, improved approaches such as the Spiral Model by Boehm (1988) or

the Vee-Model (Forsberg & Mooz, 1998) were introduced. As a shared weakness between

all previously mentioned approaches, the need of continuously repeating stages was not

anticipated proactively but was executed in a reactive way. For this reason, methodologies

with an iterative character in form of Software Development Life Cycle (SDLC) shown in

Figure 3.9 emerged (Everett & McLeod, 2007, p. 57).

Figure 3.9: Software development life cycle according to bigwater.consulting

Collaboration and Automation

Originally introduced as an agile framework for software development, the scrum technique

uses incremental, iterative work sequences to manage and enhance the speed of

development projects. When working in a team of developers, work is divided into actions

that are completed within sprints. Daily scrums serve to track process and re-plan (Figure

3.10). The framework does not only find application in fields related to software but also and

manifested itself in other areas such as product development. The topic has lost none of its

relevance so that guides for the scrum methodology are continuously developed and

improved (Schwaber & Sutherland, 2020).

System

Requirements

Software

Requirements

Analysis

Program Design

Coding

Testing

Operations

1Planning

2Analysis

3Design

4Implementation

5Testing &

Integration

6Maintenance


Figure 3.10: Scrum framework according to Scrum.org

In 2009 during a conference about deploying software, the term DevOps as a combination of

the words software development and IT operations was created. The main idea is to focus on

the collaboration between developers and operators while not neglecting the alignment with

the business side. DevOps provides a collection of field-tested and working approaches to

address this problem (Halstenberg et al., 2020). Figure 3.11 shows a graphical visualization

of the DevOps process which has been adapted widely in the community. The DevOps

approach features building blocks which correspond to the phases in previous software

development models with the main difference to traditional approaches being the

collaboration between development and operations.

Figure 3.11: DevOps approach according to Harlann (2017)

Practices such as continuous integration (CI), continuous delivery (CD) and continuous

deployment go even one step further by automating steps. They aim to reduce the required

effort and avoid possible errors which can occur during a manual execution resulting in a

more efficient and more secure development process. Figure 3.12 reveals the difference

between the mentioned practices.

Sprint

Backlog

Product

Increment

Daily

Scrum

Scrum

Team

Sprint

Planning

Product

Backlog

Sprint

Review

Sprint

Retrospective

code

test monitor

bu

ildo

pe

rate

deploy

Dev Ops


Figure 3.12: Continuous integration, delivery and deployment according to Pennington (2019)

3.3.2 ML Software

Deploying an ML model can be understood as developing a program around the model

which is to be made accessible to the end-user. As a very important distinction, deployment

as part of the ML workflow must not be mistaken for the deploy step in software development

approaches such as DevOps. The deployment of ML models includes the whole software

development process. Nonetheless, there are three fundamental differences between

developing an ML application in comparison to conventional software. First, the handling of

data is more complex. Second, skills in both software engineering and ML are required. And

third, software components are interweaved with no clear boundaries. It is therefore all the

more important to understand the challenges of ML software and address problems timely

(Amershi et al., 2019).

Since ML places special demands on software development, but fundamental similarities

exist, approaches from software engineering can be tailored to ML in a modified form. In this

way the discipline of MLOps arose. Just like in DevOps, robust automation, trust and

collaboration between teams as well as delivering high quality along the whole end-to-end

service life cycle play key roles. However, MLOps represents a new and unique discipline as

deploying traditional software is not as complex as deploying an ML application into

production (Treveil & Dataiku Team, 2020, pp. 6–7). In comparison to the complexity of the

environment in ML in form of dynamically changing data in ML, traditional software is

relatively static. Graphically, the DevOps circle can be extended to include ML as an

upstream task, as illustrated in Figure 3.13.

Figure 3.13: MLOps (Neal Analytics, 2020)

code

test monitorb

uild

op

era

te

deploy

Dev Ops

Continuous

Integration

Continuous

Delivery

Continuous

Deployment

PlanCreate

Data

Model

VerifyPackage

Release Configure

Monitor

ML Dev Ops

4 State of the Art 22

4 State of the Art

In this chapter, the state of the art is analyzed by examining already existing approaches

about ML deployment for predictive quality in production. After the definition of evaluation

criteria, a literature review is executed including the evaluation of selected publications. Out

of the selected approaches, the most relevant ones are then presented in more detail.

4.1 Definition of Evaluation Criteria

Selecting suitable criteria for the evaluation of available concepts represents an important

step in this work. The criteria must be defined in a coherent way so that, if all criteria are met,

the overall objective is achieved. According to Keeney and Gregory (2005) appropriate

attributes are unambiguous, comprehensive, direct, operational, and understandable. In

other words, good evaluation criteria need to be accurate, not redundant, ends-oriented,

practical, and easy to understand. Furthermore, independence between attributes is to be

strived for (Keeney, 1992).

In order to structure the defined criteria, they are assigned to the object domain, the solution

hypothesis, or the target domain. With criteria belonging to the object domain, the scope of

the analysis is evaluated. The purpose and goal of an approach is captured by criteria

associated to the target domain. With the help of criteria regarding the solution hypothesis,

the specific solution path chosen by the respective authors to achieve the goal is assessed.

Table 4.1 gives an overview of the defined evaluation criteria, which can be applied to

sources across different fields of investigation.

Table 4.1: Evaluation criteria

Category Evaluation Criteria

1. Object Domain 1.1 Deployment of ML

1.2 ML for Predictive Quality

2. Solution Hypothesis 2.1 Strategic Planning

2.2 Operational Realization

3. Target Domain 3.1 Guideline Structure

3.2 Transferability


Deployment of ML

As the first of two criteria of the object domain, it is evaluated with which level of detail

approaches deal with the deployment of ML models or any AI application. Based on

evidence from studies, chapter 2 showed that many ML projects fail during the deployment

phase and all activities associated with deploying ML models need to be analyzed critically.

Thus, it is necessary to evaluate existing approaches with respect to their touch points with

deployment. These might range from only mentioning the deployment up to focusing on

deploying ML models and ML software.

ML for Predictive Quality

A missing coordination between roles is one of the common reasons for unsuccessful

realization of projects (see chapter 2). In other words, the inclusion of domain knowledge is

crucial for the success of ML in general but also for the deployment. In chapter 3.1.2,

exemplary use cases of ML for quality prediction in production were described highlighting

the relevance and potential of the application of ML in this context. Predictive quality use

cases come with their individual requirements which can deviate greatly from applications in

other fields and environments. Therefore, existing approaches are examined to see to what

extent they cover the application of ML models for predictive quality. Available sources may

not treat the use of ML in manufacturing environments at all. Others may describe its use for

generic purposes in production with predictive quality being the most specific application.

Strategic Planning

This and the following criterion relate to the question of how the authors address the topic.

As seen in the detailed description of the problem in practice, the success of ML projects

depends both on strategic planning and operational realization.

High-level decisions comprise planning activities that shape the future direction by

determining the desired characteristics of the system under consideration of the selection’s

importance due to the long-term effects of the decision. This criterion analyzes the level of

detail with which existing approaches cover the deployment from a strategic point of view.

Approaches which fulfill this criterion to an advanced degree present detailed concepts and

frameworks in form of theoretical analyses or generic procedures.

Operational Realization

In addition to the strategic perspective, the challenges for deployment also need be

addressed on a more practical level. Among many others, relevant factors for the quality and

efficiency of implementation activities are the use of best practices and selection of tools. By

means of this criterion, approaches are assessed regarding their operational depth. This

operational point of view focuses on practical questions about the best way of successfully

realizing the implementation. Approaches may present a use case from a real-life application

including the provision of tools, implementation steps and results.


Guideline Structure

As an overall objective, this thesis aims to answer the research question formulated in

chapter 1.2 which focuses on how to deploy ML models for predictive quality in production.

Based on the current status of ML projects in practice, companies need instructions in form

of a guideline which they can transfer to their individual needs for successfully realizing the

deployment.

With the help of this penultimate criterion, the format of the respective approaches is

assessed to determine to what extent the respective approach serves as a guideline.

Approaches that aim to be used as a guideline may provide step-by-step instructions or a

clear structure which allows the reader to gain insights and knowledge on how to address the

topic.

Transferability

Not only is a guideline format required, but also the transferability to different use cases,

which also fall into the category of ML in production but are based on a different set of

requirements, needs to be given. Thus, this criterion evaluates how well the respective

approaches are transferable to further use cases with company-specific needs and

restrictions in order to find the best fitting solution for any specific situation. It is analyzed how

easily existing approaches can be implemented in different environments or based on a

different set of requirements.

4.2 Literature Review

Based on the procedure for literature analysis defined by Cumbie et al. (2005) the following

methodological steps are applied in this work:

1. Accumulation and selection of publications

2. Categorization of publications

3. Evaluation of publications

At first, a pool of publications is accumulated out of which the most relevant ones are

selected. As a second step, selected publications are classified by their type and method. In

the third and final phase, an analysis of the results in form of an evaluation is conducted.

4.2.1 Step 1: Accumulation and Selection of Publications

Accumulation of Publications

As the deployment of ML models for predictive quality in production is located on the verge of

two fields of investigation, production engineering on the one hand and software engineering

on the other, the literature review needs to reflect this. Due to the quite different character of

both fields regarding the availability of literature, an individual search process for each field is

required. Both searches are not sharply separable, so that results can appear multiple times

within one search or across the two searches.


Table 4.2: Search strings

Search I

Focus on industrial production engineering

Search II

Focus on software engineering

machine learning

AND

deploy*

AND

predictive quality OR production quality OR

product quality OR manufacturing quality OR

quality prediction

machine learning OR artificial intelligence

AND

deploy*

AND

producti* OR model serv* OR software

engineering

Note: The asterisk (*) indicates a set of key words beginning with the respective prefix.

For publications treating the application of ML models in manufacturing settings, the principal

source is the database ScienceDirect containing journal articles. The search strings in the left

columns of Table 4.2 are used to search within the title, abstract and key words. Combining

all searches, 133 search results were found in ScienceDirect with the same articles

appearing multiple times across different search terms. Furthermore, the same search

strings were used for a search in SpringerLink. The search was limited to sources with

“machine learning” in the title and having to contain the term “deployment” and each of the

listed quality-related key words. The search resulted in 52 items. Given the search

parameters, all the beforementioned publications focus on the application of ML models for

quality-related tasks in manufacturing. From a production engineering perspective, they

provide relevant use cases and help to identify the needs that are relevant for the

deployment but do not cover the deployment from a software engineering point of view.

Illuminating the deployment of ML models as a software engineering topic requires a different

set of search strings, which can be found in the right column of Table 4.2. Initially, the search

is conducted in a similar manner to the previously described procedure by conducting an

advanced search in ScienceDirect and SpringerLink. In ScienceDirect, the search resulted in

77 findings. In SpringerLink, the search was adjusted in such a way that all of the key words

in the last group were not search independently but in combination (AND instead of OR).

This led to the identification of 119 items. As an additional source, the ACM digital library was

consulted offering a comprehensive collection of full-text articles covering the fields of

computing and information technology. It allows to find conference contributions that were

published as proceedings. By searching for machine learning or artificial intelligence in the

title, deploy in the abstract and the variation of the last group of the key words in the full text,

88 search results were found.


In order to gather all relevant sources, searches in dynamically changing areas of

investigation cannot be limited to academic publications but need to include so-called gray

literature. In a field like software engineering, the academic literature only gives an

incomplete view on the topic. Through gray literature publications, practitioners can provide

contextual information from their experience and, thus, verify scientific outcomes from a

practical point of view (Garousi et al., 2019). As activities for deploying ML models are

closely related to software engineering, the search in this thesis considers gray literature. ML

deployment is characterized by being a dynamic subject with a lot of approaches which do

not follow scientific rules. In many cases, best practices are created by professionals to

streamline the processes in the real world and only then transferred into academic

environments. Thus, relevant sources also include conference contributions which are not

published in a proceedings book, white papers from associations and companies, and

internet articles such as blog entries by highly acknowledged practitioners to share

experiences from practice. To identify the application-oriented publications, Google Scholar

and the regular Google Search Engine were used. The searches, based on the same search

string as described before, resulted in a very high number of results out of which the first

pages of results were considered.

Selection of Publications

Based on the found search results, the most relevant items for this thesis are selected.

Depending on the type of publication, selection criteria differ. For all resulting journal articles,

either from production or computer science background, the abstract is read and matched

with the scope of this work. Where applicable, the selection considered manufacturing-

related publications while discarding search results from other fields such as medicine. If a

book is selected depends on the title, table of contents and introduction. Conference papers

typically do not undergo an equally profound review process as journal articles and therefore

need to be examined with more detail regarding their quality and type of conference. Gray

literature publications such as white papers and internet articles offer the least level of

credibility and require an even more profound examination of worthiness. This examination

reviews the background and expertise of each author. Additionally, the credibility and

independence of the publishing website is assessed. Moderated online-publishing-platforms

are more likely to offer objectivity than company-sponsored pages that might aim to advertise

a certain product. The quality of content in form of the provided level of detail and extent of

the text also impacts the selection decision. Furthermore, gray literature does not only

demand a quality check as a prerequisite for the use of this kind of source but also requires a

separate archiving process as its availability, in contrast to, e.g., journal articles, is not

guaranteed in the future.


Based on this initial set of publications, further relevant sources are to be investigated.

Snowballing is a technique that can be applied to identify related publications (Wohlin, 2014).

Backward snowballing refers to identifying new relevant articles in the bibliography of a

paper, whereas forward snowballing means to examine papers citing the respective paper in

the starting set. This forward and backward search can also be applied to find similar

publications by the same author.

Additionally, it is possible to find connected papers and visualize the connections through

graphs as illustrated exemplarily through a screenshot in Figure 4.1 taken from

connectedpapers.com. The graph is not a citation tree but arranges papers according to their

similarity. Similar papers have strong connecting lines and cluster together. Each paper’s

number of citations is represented by the node size, the node color indicates the publishing

year. For publications that are identified through techniques such as snowballing or

connection graphs the same selection criteria as for the initial set apply.

Figure 4.1: Connected papers to Sculley et al. (2015) from connectedpapers.com

For the first, production engineering-related search, a total of 19 publications were selected.

As stated before, the second search focusing on the field of software engineering resulted in

27 relevant approaches that were selected for further analysis. Through the inclusion of gray

literature, the number of selected publications is slightly higher than for the first executed

search. In total, 46 publications are identified as relevant for this thesis and will be analyzed

in the following in more detail.


4.2.2 Step 2: Categorization of Publications

To classify the selected items, metadata such as publication year and type of origin are

examined and visually illustrated. Figure 4.2 arranges the final search results by the

respective year in which they were published. ML in manufacturing and deploying ML models

has gained popularity over the last years so that the most relevant sources for this work

originate from 2020. The low number of results from 2021 is due to the time of conducting

this research in spring 2021. Overall, the graphic highlights the high topicality and emerging

relevance of the subject and serves as a confirmation of the need for research. Over the

course of the upcoming years, the number of publications about the topic is expected to

increase even more.

Figure 4.2: Year distribution of selected publications (Total of 46)

In order to gain more insight into the identified results, all selected publications are classified

by the type of publication in Figure 4.3. Roughly a half of the final results belong to the

category of books respectively chapters in books and journal articles. Conference papers

make up nearly a quarter of the selected publications. Gray literature in form of white papers

and internet articles accumulate to a little more than a quarter. It can be seen that academic

literature is not sufficient, but a mix of different publication types is necessary to fully grasp

the topic.

Figure 4.3: Type distribution of selected publications (Total of 46)

1 1

3

5

8

21

7

0

5

10

15

20

25

2015 2016 2017 2018 2019 2020 2021

Book/ Chapter5

Journal Article19

Conference Paper10

White Paper3

Internet Article9 10,9%

41,3%

21,7%

6,5%

19,6%


4.2.3 Step 3: Evaluation of Publications

As a third and final step, the content of each approach within the selected results is

evaluated. For a consistent evaluation of the literature, the existing approaches are assessed

with the help of the criteria defined previously in chapter 4.1. As there are no binary criteria

like yes-no questions, it is analyzed to what extent each criterion is fulfilled by the existing

approaches. The degree of fulfillment of each criterion ranges from not at all fulfilled, to

sparsely, partly, mainly, and completely fulfilled. Table 4.3 has the respective authors listed

in the rows and the evaluation criteria in the columns with the fulfillment degree for the

approaches being visualized by Harvey balls. The approaches are grouped according to their

focus, either production or software engineering. They are arranged regarding the publishing

year and sorted by the author’s name within each year.


Table 4.3: Evaluation of existing approaches

Explanation:

● Completely fulfilled

◕ Mainly fulfilled

◑ Partly fulfilled

◔ Sparsely fulfilled

○ Not at all fulfilled

1.1

Deplo

ym

ent

of M

L

1.2

ML f

or

Pre

dic

tive Q

ualit

y

2.1

Str

ate

gic

Pla

nnin

g

2.2

Opera

tional R

ealiz

ation

3.1

Guid

elin

e S

tructu

re

3.2

Tra

nsfe

rabili

ty

Search I

Brüning et al. (2017) ◔ ● ◔ ◕ ○ ◔

Vafeiadis et al. (2017) ◔ ● ◕ ○ ◔ ◑

Mehta et al. (2018) ◑ ◕ ◕ ◑ ○ ◑

Nalbach et al. (2018) ◔ ● ◑ ◕ ○ ◑

Ariharan et al. (2019) ◕ ◕ ◑ ◔ ○ ◔

Escobar et al. (2020) ◔ ● ◕ ◕ ○ ◔

Kimera and Nangolo (2020) ◑ ◑ ◑ ● ○ ○

Krauß et al. (2020) ◔ ● ◑ ◑ ○ ◔

Lehmann et al. (2020) ◑ ◕ ● ● ○ ◔

Rychener et al. (2020) ◕ ◑ ● ◔ ◔ ◕

Schorr et al. (2020) ○ ● ◑ ◑ ○ ◔

J. Schmitt et al. (2020) ◑ ● ◕ ◑ ○ ○

Svetashova et al. (2020) ○ ● ◔ ● ○ ○

Yong and Brintrup (2020) ◑ ◑ ◑ ◑ ○ ◑

Goldman et al. (2021) ◔ ● ○ ◕ ○ ○

Lichtenwalter et al. (2021) ◑ ◕ ◔ ◑ ◔ ◕

Pilarski et al. (2021) ◕ ◔ ◑ ◕ ○ ◔

Turetskyy et al. (2021) ◑ ◕ ◕ ◑ ○ ◔

Zeiser et al. (2021) ◔ ● ◔ ◔ ○ ○


1.1

1.2

2.1

2.2

3.1

3.2

Search II

Sculley et al. (2015) ◑ ○ ◕ ◔ ◑ ◕

Zinkevich (2016) ◕ ○ ◕ ◑ ◕ ●

Breck et al. (2017) ◑ ○ ◕ ◔ ◑ ◕

Ackermann et al. (2018) ● ○ ◔ ◑ ◔ ◔

Crankshaw and Gonzalez (2018) ◕ ○ ◑ ○ ◔ ◔

Muthusamy et al. (2018) ● ○ ◑ ◔ ◔ ○

Amershi et al. (2019) ◕ ○ ◕ ○ ◔ ◑

Gisselaire et al. (2019) ● ○ ◕ ◔ ◔ ◕

Kervizic (2019) ● ○ ◑ ◔ ◕ ◑

Lwakatare et al. (2019) ● ○ ◕ ◑ ○ ◔

Samiullah (2019, 2020) ● ○ ◕ ○ ● ◕

Sato et al. (2019) ● ○ ◕ ◕ ● ◑

Washizaki et al. (2019) ◕ ○ ◑ ○ ○ ○

Agrawal and Mittal (2020) ◑ ○ ◔ ○ ◕ ○

Akyildiz (2020a, 2020b) ● ○ ◑ ○ ◑ ●

Bhatt et al. (2020) ● ○ ◕ ○ ◔ ○

Debauche et al. (2020) ● ○ ◑ ◑ ◔ ◑

Figalist et al. (2020) ◕ ○ ● ◑ ◑ ◑

Liu et al. (2020) ● ○ ◑ ◔ ◔ ◑

Odegua (2020) ● ○ ◔ ◔ ◕ ◕

Pääkkönen and Pakkala (2020) ● ○ ● ◔ ○ ◑

Patruno (2020) ● ○ ◑ ● ● ◑

Pinhasi (2020) ● ○ ◑ ◔ ◑ ●

John et al. (2021) ● ○ ● ○ ◔ ◑

Singh (2021) ● ○ ◑ ● ◕ ◕


Results of search I contain scientific publications from an industrial or production engineering

background. When looking at the reached fulfillment of the first two criteria, which belong to

the object domain, it is possible to detect the origin of the articles. The approaches cover

possible applications of ML in manufacturing, not always the use case of predictive quality

but similar use cases and quality-related issues. Thus, the respective criterion is fulfilled in

many cases. As the approaches focus on the development for these ML models, the

deployment itself is not covered sufficiently. Some approaches do not provide any useful

information about the deployment, others do give some insights on the deployment process.

A common solution path for authors of items in this category is provide either a strategic or

an operational perspective on the subject. Hence, not many approaches perform well at both

criteria of the solution hypothesis. Finally, the mean performance in the two criteria of the

target domain is very different. On the one hand, almost no approach serves as a guideline.

In many cases, a specific application is presented. Even though the authors do not provide

instructions, the use cases often are general enough to be transferable to other use cases.

In search II, approaches about the deployment of ML models from a software-related

background are grouped together, whose focus does not lie on possible quality-related

applications in manufacturing but on the deployment process. Therefore, the performance of

the approaches in the first two criteria is evident. Most approaches focus on or at least treat

the deployment as a key factor. But the connection to a manufacturing environment is not

made. Similar to the previous category of publications, many approaches do not cover

strategic planning and operational realization at the same time. The items in the table that

were identified through a gray literature review, typically fulfill the guideline criterion the best.

As many of them are aimed at being used by other people, an easy transferability to other

use cases is given in many cases.

Overall, both categories, that were introduced in the search, are characterized through their

own strengths and weaknesses. Approaches from engineering describe the requirements

and specialties of applications in the industry, but do not provide instructions which makes it

difficult to adapt the presented concepts to individual new use cases. Publications with a

software engineering focus more commonly provide instructions but do not consider the

specific circumstances when deploying in a production process. Regarding the strategic and

operational depth, in both categories one can find only rarely an approach that covers both

levels. As a conclusion, there is no approach fulfilling all criteria to a satisfactory degree.


4.3 Most Relevant Approaches

In the following, a selection of the most important approaches from both categories is

presented. These relevant approaches introduce concepts which lay the basis for the

development of the own methodology in the further course of this thesis.

Automated ML for Predictive Quality by Krauß et al. (2020)

ML-based quality prediction allows the reduction of production lead-time and repair costs, but

heavily depends on specialized human resources. This is where AutoML comes into play, a

technique to automate all repetitive and uncreative ML tasks in order to increase the time

spent on creative tasks. A process chain consisting of six different processes, through which

each product runs sequentially after passing a Quality Assurance-Gate at the end of each

process (Figure 4.4). If a product is in-spec or off-spec is predicted by an ML model. It

classifies each product after completing process 5 into “ok”, “failure A” or “failure B”.

Figure 4.4: Illustration of the process chain (Krauß et al., 2020)

The authors focus on setting up and testing automated techniques for ML and only hint at

possible deployment options without further explaining or classifying them. Deploying the

model as a web server in combination with an API is identified as the most common

approach. Furthermore, on-edge deployment are described briefly.

Predictive model-based Quality Inspection by J. Schmitt et al. (2020)

In the publication, a prediction model based on supervised ML algorithms which allows to

predict the final product quality on the basis of recorded process parameters is developed

and deployed into the IoT-architecture of a manufacturing plant. This integrated solution of

predictive model-based quality inspection in industrial manufacturing is based on the fields of

ML techniques and edge cloud computing technology. Edge cloud computing combines

cloud computing and computing on an edge-device. In the framework (Figure 4.5) the

deployment as the organizational integration into the inspection planning process is

distinguished from the technical implementation which gives an orientation for the individual

configuration of the system according to requirements and resource constraints. Lastly, the

technical integration into the existing infrastructure is too individualized and therefore not

covered by the authors. The process is illustrated by a real-world use case also including a

brief description of tools that were used.

Start Process 1 Process 2 Process 3 Process 4 Process 6Process 5 End

Product runs sequentially through the process chain

QA: Quality

Assurance

QA:

In-spec or

off-spec

QA:

In-spec or

off-spec

QA:

In-spec or

off-spec

QA:

In-spec or

off-spec

QA:

In-spec or

off-spec

QA:

In-spec or

off-spec


Figure 4.5: Predictive model-based quality inspection framework (J. Schmitt et al., 2020)

Hidden technical Debt by Sculley et al. (2015)

In their widely recognized paper about hidden technical debt in machine learning, Google

researchers Sculley et al. explore several ML-specific risk factors to account for in system

design in order to avoid massive ongoing maintenance costs in real-world ML systems.

Figure 4.6 illustrates the complexity of ML graphically by recognizing that a mature system

might end up being 5 % machine learning code and 95 % glue code, which does not add any

functionality but only serves to make different parts of code compatible. Due to the enormous

complexity, small changes may cause incalculable effect. The authors refer to this as the

CACE principle: Changing Anything Changes Everything. Thus, a tiny accuracy benefit at the

cost of massive increases in system complexity is not recommended.

Figure 4.6: ML Code as small fraction of ML systems (Sculley et al., 2015)

ML Test Score by Breck et al. (2017)

In order to reduce the technical debt, Breck et al. introduce testing and monitoring for

ensuring the production-readiness of an ML system. But again, testing of ML systems proves

to be more challenging than in traditional software systems due to the strong dependency on

data and models. Figure 4.7 indicates the increase in complexity necessary. The approach

provides a checklist which can be run against any ML system.

Technical integration

Technical implementation

Data StorageModel training

and scoring

Model deploymentData collection

and processing

Physical process

1 3

2

4

5

ConfigurationData Collection

Feature Extraction

Data

Verification

Machine

Resource

Management

Serving

Infrastructure

Monitoring

Process

Management Tools

Analysis ToolsML

Code


Figure 4.7: Traditional system and ML-based system testing and monitoring (Breck et al., 2017)

Rules for ML by Zinkevich (2016)

The author presents best practices in ML from his experience at Google in a similar form to

guides to practical programming. In total, 43 rules arranged in four parts are presented. In

the first part called “before ML”, there are three rules that aim to help determining whether

the time is right for building an ML system. The second part, ML Phase I, is about deploying

an initial pipeline and monitoring it on the basis of adequate objectives. This section contains

12 rules. ML Phase II, as the third of the four parts, comprises 22 rules about launching and

iterating while adding new features to the pipeline. These rules also treat the evaluation of

models and the training-serving skew. Finally, in the last part called ML Phase III, six rules

about slowed growth, optimization refinement, and complex models are discussed.

CD4ML by Sato et al. (2019)

When researching ML deployment from an application-near point of view, the publication by

Sato et al. in collaboration with Martin Fowler is one of the most cited references. Continuous

Delivery for Machine Learning (CD4ML) is a software engineering approach to develop,

deploy, and continuously improve ML applications. The end-to-end process (Figure 4.8)

consists of steps that can be automated with three axis that are subject to change and must

be taken into account: code, model and data. With the help of an example, the steps

beginning with the model building and ending with the monitoring and observability are

illustrated. Relevant aspects such as ensuring discoverable and accessible data, setting up

reproducible model training, tools for collaboration, hosting and exchange formats are

covered.

Code Running System

Unit

Tests

Integration

Tests

System

Monitoring

Traditional System Testing and Monitoring ML-Based System Testing and Monitoring

Code Model Training Running System

Unit

TestsIntegration Tests

System

Monitoring

Prediction

Monitoring

Data

MonitoringSkew TestsData Tests

Data Data

ML Infrastructure TestsModel

Tests


Figure 4.8: Continuous delivery for ML end-to-end process

With regard to model serving, three approaches are proposed. In the first option, the model is

embedded into the consuming application and is deployed as an artifact. A second option is

to deploy a model as a separate service where the model wrapped in a service that can be

deployed independently of the consuming applications. Finally, the model can also be

published independently, but the consuming application will ingest it as streaming data in real

time.

The necessity of testing of data quality, component integration and model performance as

well as experiments tracking is stressed. Complex scenarios for deploying models such as

shadow deployment are mentioned and briefly explained. In addition to the tests before the

deployment step in the process, monitoring and observability includes checking and

interpreting the model’s inputs, outputs and performance.

Putting ML Models in Production by Kervizic (2019)

The author provides an overview of different approaches to deploying ML models in

production identifying two important considerations in the form of train and serve. Training

can be executed as one off, batch training and real-time/ online training. Not updating the

model in production is called one off training. Batch training, in comparison, describes the

process of releasing a refreshed version of the model based on the latest train whereas

continuously updating is called online respectively real-time training. Each training scenario

comes with its own advantages and disadvantages. With regard to serving predictions to

systems wanting to consume the information there are batch predictions and real-time

predictions, which differ in the capability of ingesting live input and thus have implications on

cost and complexity of the computation infrastructure. For batch predictions, the predictions

are served through data exchange formats. Real time predictions can be served through a

Model Building

Model

Evaluation and

Experimentation

Productionize

ModelTesting Deployment

co

de

mo

de

ld

ata

Training

code

Test

code

Application

code

Monitoring and

Observability

Training

data

Candidate

models

Test

data

Metrics

Chosen

model

Productionized

model

Test data

Model

Code and model

in production

Production

data


database trigger, a pub/sub model, a web-service or even in-app. Each approach is suitable

for different situations and requires different technologies for the implementation.

How to deploy ML Models & Monitoring in Production by Samiullah (2019), (2020)

Deployment of ML models is hard as it combines all the challenges of traditional code with an

additional set of machine learning-specific issues. The first step is to derive an adequate ML

system architecture from business requirements and company goals by specifying the

necessity of real time predictions, model update frequency, data characteristics, regulated

environment and the team’s experience. The author, who also appears as a professor in an

online Udemy course called “Deployment of Machine Learning Models”, proposes four

common architecture patterns (see Table 4.4) which each have their pros and cons and must

be selected according to the specific use case.

Table 4.4: Four potential ML system architecture approaches

Pattern 1

REST API

Pattern 2

Shared DB

Pattern 3

Streaming

Pattern 4

Mobile App

Training Batch Batch Streaming Streaming

Prediction On the fly Batch Streaming On the fly

Prediction result

delivery

Via REST API Through the

shared DB

Streaming via

Message Queue

Via in-process

API on mobile

ML systems need to fulfill some key principles such as reproducibility, which consists of

building reproducible pipeline from data gathering, data preprocessing, variable selection and

model building, as well as testing, which is a crucial aspect and may be executed in form of

differential, benchmark and load/ stress tests. Tools for containerization, CI/CD, hosting

platforms and emerging frameworks for managing the ML life cycle are addressed just as the

need for monitoring and alerting.

In a subsequent post, the same author focuses on monitoring ML models once they are

deployed. Monitoring in combination with testing is used to understand the spectrum of ML

risk management with the trade-off between level of confidence in the model’s behavior and

the ease of making adjustments. Data science issues occur regarding the data, whereas

operational issues are linked to the system performance. Observability describes the ability

to comprehend what is happening inside of the system. The use of metrics and logs for

monitoring purposes is explained and illustrated through pseudo code, before closing with a

current overview of the constantly changing software landscape.


How to serve and monitor Models by Akyildiz (2020a), (2020b)

In the blog post by an engineering manager at Facebook, the three most common ways of

serving models are identified. The first pattern is to materialize/ compute predictions offline

and serve through a database. The second architecture consists of using model within the

main application, so the model serving/ deployment can be done with the main application

deployment. The third option how to serve models is to use the model separately in a

microservice architecture where you send input and get output. Each architecture is

evaluated in detail with regard to criteria such as system set-up effort, maintainability,

scalability and infrastructure complexity, real-time capability, flexibility, and traceability

showing which advantages and disadvantages need to be considered.

Based on serving models, a further blog by the same author identifies the importance of

monitoring the performance of the ML model, service downtime as well as changes in data

and behavior. The author describes specific aspects that need monitoring and provides

adequate criteria for monitoring such as metrics. In addition, best practices for effective

monitoring are briefly mentioned.

Ultimate Guide to Deploying ML Models by Patruno (2020)

In a series of blogs and starting with the relevant factors for ML deployment and the

interaction of the end user with the ML model, the author presents the importance of best

practices. Standardized software interfaces with defined inputs and outputs reduce the

implementation effort. Furthermore, model registries serve to store and track trained ML

models. The next key decision is the selection of the type of inference system. If a batch

inference scheme that precomputes predictions in batch does not fulfill the requirements, real

time predictions can be generated through online inference infrastructure which comes with a

set of challenges. Relevant considerations in the selection process are explained in detail.

Testing is crucial for ML, so the author recommends executing tests for each added function

in the form of test-driven development. Offline testing is done before deployment and focuses

on ML performance metrics. Online validation, also known as experimentation, aims to detect

causality between the deployed ML models and business KPI. At the same time, model

monitoring is used to detect the need for retraining. The extensivity of tests depends on

complexity of the application, the business cost of model errors and the resource constraints

of the organization. In order to illustrate the procedure, each aspect of the guideline is

illustrated through pseudocode of an exemplary use case. Providers for deployment

configurations or tools for ML are referred to. In total the process of deploying new model

versions is described as "non-trivial" for reasons like the conflicts between the prototyping

team and deployment team.


4.4 Theory Deficit

Concluding, a theory deficit can be derived from conducted analysis of the state of the art.

Through the evaluation of existing works based on a set of defined criteria, a research gap

concerning the deployment of ML models in production is identified. The existing

approaches, results of a thorough review of academic and informal literature, do not answer

the research question satisfactorily. Thus, further research on the basis of the found results

is necessary.

5 Outline of the Methodology 40

5 Outline of the Methodology

In accordance with the defined object of the thesis and the findings of analyzing not only the

problem in practice but also the state of the art from an investigation-related point of view,

the development of the methodology is subject to preceding considerations and boundary

conditions. These include the requirements which are to be met by the methodology, an

exact definition of the scope, and a contextualization with respect to existing frameworks.

5.1 Requirements

The developed methodology is subject to individual content requirements based on the set

objective of the work. At the same time, generally valid formal requirements for the

development of a methodology apply.

5.1.1 Content Requirements

Answering the research question formulated in chapter 1 while considering the practical

deficit outlined in chapter 2 presupposes that the solution proposal meets some content

requirements. The evaluation criteria used to derive the theoretical deficit in chapter 4 form

the basis for the following content-related requirements:

• Deployment of ML: The methodology focuses on deploying ML models.

• ML for Predictive Quality: The methodology focuses on applications of ML for

predicting the quality in manufacturing processes.

• Strategic Planning: The methodology provides the relevant high-level decisions for a

successful deployment of ML models into production.

• Operational Realization: The methodology presents relevant aspects of the practical

implementation of the deployment process.

• Guideline Structure: The methodology can be used as a guideline with instructional

character which can be followed in order to deploy ML models successfully.

• Transferability: The methodology can be transferred to further use cases

characterized by different requirements.

5.1.2 Formal Requirements

The starting point for defining formal requirements for a methodology is the term itself. Here,

a methodology is described as a “set of methods used in a particular area of study or activity”

(Cambridge Dictionary, 2014). These methods are applied to gain scientific or practical

knowledge with the help of models as representation of real systems. Therefore,

requirements from model and system theory are initially placed on the methodology

(Stachowiak, 1973):


• Representation: The methodology represents the defined observation area.

• Contraction: The methodology simplifies the overall system to the relevant attributes

and elements.

• Pragmatism: The methodology can be applied by a specific user in a target-oriented

way.

In addition, the requirements established by Patzak (1982) in the context of systems

engineering apply:

• Empirical correctness: The methodology is consistent with reality.

• Formal correctness: The methodology is free of contradictions.

• Productivity: The methodology is providing useful answers.

• Manageability: The methodology is easy to apply and interpret.

• Low effort: The methodology’s use is associated with low effort.

5.2 Scope

Due to the huge extent of the topic, clearly defined boundaries of the scope are necessary.

These boundaries concern the programming language, used algorithms, and use case

characteristics.

As a first limitation, only ML models that are developed in the programming language Python

are considered. Python constitutes the quasi-standard for ML with the main reason for this

being is the availability of provided libraries (Subasi, 2020, p. 96). In addition to the libraries

available, there are even more advantages for Python: its clear syntax, easy text

manipulation, its popularity with people and organizations, open-source availability, and

readability through pseudo-code (Harrington, 2012, pp. 13–15). This restriction excludes ML

models written in R or other languages, but it does not mean that the final application must

also be programmed in Python. The applications that provide the model to the user may very

well be built in alternative languages depending on the device the user chooses to access.

Consequently, the methodology does not exclude languages such as Java or C if they come

into question for the development of the application.

There is an enormous choice of different algorithms for ML models in Python. Therefore, this

thesis focuses on regular ML algorithms and leaves out neural networks, which have their

own individual requirements for the deployment. Deep learning algorithms in combination

with Big Data require much more computing power needed resulting in higher cost, longer

training process and data handling issues due to data size. Furthermore, only those

algorithms are considered which process structured, tabular data and do not receive

unstructured data inputs in form of video, audio or other types of signals. This restriction is

legitimate due to the intended application area which is predictive quality. Sensor data from

production is fed to an ML model that aims to predict the output quality of the process.


The described use case, not only manufacturing in general but predictive quality in particular,

applies to a manufacturing companies with high quality standards. If quality is not a

designated business strength, the complex use of ML is not necessarily worth the effort.

Companies, for which predictive quality is beneficial, have in common is that they are

specialized, mostly medium-sized companies with their core competence in production.

Software and IT are seen as a tool to support the production, but not as a core value. Small

companies do not have the manpower in IT or the knowledge of creating and maintaining

complex ML systems. They focus on the fabrication of products and use ML as a tool to

improve the processes, specifically to improve the quality. Consequently, these companies

cannot be compared to big internet firms which use ML as an essential part of their business

model and thus can allocate a considerable number of resources on development and

deployment of ML models. For small and medium sized companies, the factors of cost and

dependency are very important so that non-specialized concepts and tools in form of

open-source solutions are treated primarily. Companies with purely digital business models

may have different requirements and may evaluate concepts and tools differently than those

traditional companies considered here. The choice of tools and concepts has far-reaching

consequences in terms of dependency on third parties and own efforts. Bringing ML to

production must come with a reasonable effort and price. At the same time, they do not want

to be dependent on one specific software solution by one provider only, so open-source

solutions become more relevant. They are free and thus reduce the cost and dependency on

a provider. Measuring and evaluating these efforts financially is not part of the methodology.

Rather, proposed solutions are assessed qualitatively without making concrete quantitative

statements about economic aspects.

5.3 Reference Framework

The methodology for deploying ML models is not developed in isolation as a stand-alone

entity but is inserted into an existing framework, the AutoML pipeline as shown in Figure 5.1.


Figure 5.1: AutoML pipeline in the context of production based on Krauß

Starting with the use case selection, the pipeline provides an overview of the end-to-end

operations needed to enable ML in a production environment and identifies the expertise

needed to execute the processes. It covers the entire ML lifecycle from a macro perspective.

As the first block, data integration treats the process of gathering relevant data from several

different sources. With data residing at many different sources, combining them can prove

itself as a challenging task. In the data preparation block, the dataset is generated. Data

preprocessing operations target the increase in data quality followed by feature engineering,

the pre-computation of features to facilitate the extraction of knowledge by an ML algorithm.

In modeling, a model is generated. After choosing an algorithm, the algorithm selection, the

algorithm is set up through hyperparameter optimization. In training, the ML algorithm is ran

to generate a model. In diagnosis, the focus is on understanding a model’s results from a

domain expert perspective. The modeling steps are executed in several iterations, often

requiring further adjustments in the dataset. Even between data preparation and modeling

feedback loops may be necessary.

The last and final block is the deployment. The deployment itself is broken down into the

following sub-phases: Deployment design, productionizing & testing, monitoring as well as

retraining. These sub-phases are presented and elaborated in detail in the following chapter.

Based on a successful deployment, certification aspects can be considered.

Data Integration

Data Preparation Modeling Deployment

AutoML Pipeline in Production

Data & Process Understanding

Data Preprocessing

Feature Engineering

Algorithm Selection

Hyperparameter Tuning

Training

Diagnosis

Use Case Selection Certification

Deployment Design

Productionizing & Testing

Monitoring

Retraining

6 Development of the Methodology 44

6 Development of the Methodology

Within the course of this chapter, the results of the development of the methodology are

presented. Figure 6.1 summarizes the core findings for each phase which are subsequently

explained one after another.

Figure 6.1: Overview of methodology

Deployment

Deployment Design

(→ Chapter 6.1)

Monitoring

(→ Chapter 6.3)

Productionizing & Testing

(→ Chapter 6.2)

Retraining

(→ Chapter 6.4)

Consuming Application

How are predictions consumed?

Web app (accessible via browser)

Native app (installed on device)

Hosting Solution

How is the system hosted?

On-premises

Cloud

Prediction Approach

How are predictions made?

By batch

In real time

Learning Method

How are models updated?

Offline

Online

Model Serving

How are models served?

Embedded (into consuming application)

Separate (from consuming application)

Execution

How is the retraining executed?Extent

How extensive is the retraining?

Trigger

How is the retraining triggered?

Understand

What is the system doing?

Monitor

Is the system working?

Analyze

How to improve the system?

Plan

Define application

requirements

Develop

Create application

based on

requirements

Test

Define and execute

tests on the

application

Release

Roll out application to

users

Automate

Execute steps in an

automated manner


As an own contribution in this thesis, available concepts, either in form of theoretical

fundamentals or introduced in existing approaches, are analyzed and joint together in order

to form one complete concept covering the whole deployment process from end to end.

Analyzing the state of the art in chapter 4 showed that no existing approach fulfills the

objective of this thesis alone. Consequently, for the development of this methodology,

concepts from different authors and fields of investigation are combined in a structured

procedure which treats all relevant decisions and steps in the phases of the deployment

design, productionizing & testing, monitoring, and retraining. Relevant aspects supporting

these decisions and steps are provided to ensure the instructional character of the developed

methodology.

The deployment design represents a strategic decision which needs to take the use case into

account. Characteristics of predictive quality applications are especially relevant in this

design phase. Subsequent phases of the methodology face more operational issues which

are generally relevant for the deployment of ML models and not specific to a certain kind of

use case. In addition to the four phases, general aspects that overarch the whole deployment

are covered.

6.1 Deployment Design

As an initial task in deployment, called the deployment design, decision owners need to

design the ML system which then serves as a target image for the implementation. The

deployment design is a two-step phase. Firstly, the business needs and restrictions are

translated into technical requirements. Secondly, these technical requirements are used to

define the system architecture. In this way, the most suitable architecture for the given use

case characteristics can be found.

6.1.1 Pre-considerations: Design Requirements

For the technical requirements, there is only a limited number of options. In order to organize

the findings, a morphological box as introduced by Zwicky and Wilson (1967) comes to

application. The requirements comprise parameters and possible values each parameter can

assume. By breaking down the overall problem into attributes, the technique allows to

compress and structure visually the huge and disorganized variety of deployment options

and even create new, unseen solutions by combining values. In doing so, the terminology is

harmonized as different authors use different denominations for similar principles.

Figure 6.2 shows the identified parameters as well as the corresponding technical question

each parameter aims to find an answer for. Possible values for each parameter are also

depicted. By selecting one option for each parameter, the design requirements for the

system architecture are determined. Subsequently, all parameters and the available

solutions are explained focusing on the applicability in the context of production.


Figure 6.2: Morphological box for deployment design

Prediction Approach

Predictions can be made by batch or in real time. Batch predictions have a forecast character

as they do not consider real time input. In contrast, real time predictions are calculated at the

exact required moment triggered either by a user request or by the arrival of new data.

The design of the ML system is primarily impacted by the necessity of real time capability

which has implications on the effort and cost associated with the operation. With pre-

calculated predictions by batch, computing can be spread out according to available

capacity. For real time systems, the availability of the service must be ensured during peak

loads including the planning of a failover system. Thus, monitoring and debugging activities

are more complex and time critical resulting in higher cost (Kervizic, 2019).

For predictive quality use cases the decision for or against a prediction approach depends

mainly on the maximum waiting time in production. If no real time predictions are needed, a

batch system should be considered as it generates less complexity and requires less

maintenance effort.


Displaying the predictions of an ML model requires the distinction between web apps and

native apps. For the end user of the predictions, the look and feel of web apps and native

apps might be very similar, but the choice has a great impact on the system architecture. A

web app is an application that is accessible via network by any kind of connected device

without being downloaded onto the device. Native apps, on the other hand, are developed

and installed on a particular device and enable local computation. Web apps have the

advantage of being accessible from multiple different types of devices via browser. Through


How are predictions consumed?

Web app (accessible via browser)

Native app (installed on device)

Hosting Solution

How is the system hosted?

On-premises

Cloud

Prediction Approach

How are predictions made?

By batch

In real time

Learning Method

How are models updated?

Offline

Online

Model Serving

How are models served?

Embedded (into consuming application)

Separate (from consuming application)


a network the predictions are made available for users in different locations. In contrast,

native apps must be installed onto every device, but are optimized for the specific platform

and can run without network connection. Limiting factors for native apps are the required

computing power and the higher development and operation effort in comparison with web

apps. These have the disadvantage of not being able to access a device’s built-in features as

they are developed for cross-platform operation (Bignu, 2019). The decision regarding the

consuming application mainly depends on the type and variety of the used devices in

production.

Running ML models on edge devices such as mobile phones and microcontrollers has

increased in popularity due to a need of on-device data analysis (Konstantinidis, 2020). As

these devices require an application compatible with their host operating system, the

consuming application in case of edge devices corresponds to a native app.

In production environments both web apps and native apps are used. The option of a web

app shall be selected for situation which require access from different devices, an easy

usage by non-specialized employees and universal compatibility. In cases of specialized

devices in production such as wearable devices or machining tools native apps are the

logical choice. In addition to displaying the prediction in a web or native app, a notification

about the predicted quality of a product can be sent through an email, a work ticket or a

visual light on the machine allowing an initiation of remedies by the responsible agent.

Model Serving

A key parameter for the architecture is the degree of integration between the calculation and

the consumption of the predictions. One deployment option is to embed the model in the

main application. This includes the possibility of integrating the model into the front end.

Alternatively, a model can be deployed as a separate service. This can be through a

webservice in the back end which receives input und returns output. But the separate service

can also be set up in a streaming manner.

In cases with separate model serving, the predictions need to be delivered to the consuming

application. In accordance with the previously described options, there are mainly three

realizations of result delivery:

• Via database

• Via REST API

• Via streaming

Making predictions available in a database enables a direct database access from the

consuming application. Instead of a database, an API can be used to deliver the results. An

API, short for application programming interface, is not a database or a server but organizes

the access to a webservice (Eising, 2017). REST APIs work according to a request-response

principle, whereas in streaming scenarios a continuous data stream is published to which the

consuming service can subscribe. As an illustrating analogy, a REST API can be compared

to a waiter in a restaurant who takes orders and returns the desired result (Houghton, 2018).


Streaming, on the other hand, can be compared to a newsletter from an online shop. The

consumer subscribes once to a service and then automatically receives the newest data

without having to request it explicitly every time (Björklund, 2017).

Relevant considerations for the decision regarding an embedded or separate model serving

include the required serving latency and scalability. Calling an external service can increase

waiting time. However, deploying additional ML models to production is easier if models are

served separately and the development and operation of the services is decoupled. A further

crucial aspect is the induced complexity. Streaming model serving requires a complex

system set-up and is only recommended for situations in which a streaming calculation of

predictions in real time is absolutely necessary.

Learning Method

Previous phases, e.g., the selection of an algorithm in the modeling phase, have an influence

on the design of the deployment. As described in chapter 3.2.3, there is online and batch

learning. The learning method is a relevant parameter for the architectural design as it

defines if the ability of continuously training a deployed model must be given. Online learning

implies that all new data points are fed to the model for updating purposes before outputting

a prediction. If the model is learning offline, the training process can be treated separately

from the prediction process.

Similar to the prediction approach, the chosen option has an impact on the system

complexity. Offline training reduces the complexity of the ML system. In contrast, online

learning models are updated as soon as new data is available and therefore require

uninterrupted monitoring of the performance which makes the system more complex to

handle.

Typically, the selection of offline or online learning can be regarded as an input from a

previous phase in the ML life cycle. Already during the modeling, it is decided which kind of

algorithm is used. In big data and production scenarios with a very high volume of data, e.g.,

sensor data from a machine, online learning does not require to save all training data but

allows learning with the incoming data (Hunt, 2017).

Hosting Solution

As a final parameter of the deployment design, on-premises and cloud hosting are available

for selection. The responsibility for managing the whole system is distributed between the

organization itself and a cloud provider through a service level agreement (SLA). Different

cloud service levels are distinguished (see Figure 6.3). On-premises solutions are managed

completely within the organization with no external cloud provider involved. When opting for

a cloud option, a provider can supply an instant computing infrastructure known as

Infrastructure-as-a-Service (IaaS), a complete development and deployment environment in

the cloud called Platform-as-a-Service (PaaS), or a ready-to-use software solution which is

referred to as Software-as-a-Service (SaaS). As it is about hosting the ML system, which is


independent from the data hosting, the responsibility for data always lies within the

organization itself (Chen, 2020).

Figure 6.3: Cloud service levels based on Watts and Raza (2019) and Chen (2020)

An analogy illustrates the different service levels. On-premises is the equivalent of owning a

car. IaaS is like a rental car as the hardware is provided by the car rental company with some

responsibility on the renting person such as refueling. PaaS can be compared to a taxi with

whose operation the customer is not involved but still can decide on the route. Finally, SaaS

solutions are externally managed and can be compared to a bus with a fixed route (Choo,

2018).

On-premises hosting requires sufficient knowledge and resources to operate own servers

and networks. The more responsibility is given to external providers, the less effort inside the

company to manage hardware and software is involved. A trade-off between ease of use,

cost, potential dependency from an external supplier and data privacy is required.

Especially the issue of data security poses a main challenge in manufacturing use cases.

Production data represents the most sensitive information of a company so that its security

must be a top priority resulting in the advantageousness of on-premises solutions.

6.1.2 Architecture Patterns

Based on the selection of options by means of the morphological box, the system

architecture can be designed. Figure 6.4 shows common architectures in practice. In each

pattern, one option for the respective parameters introduced in chapter 6.1.1 is selected. The

figure also indicates the data flow from the data sources to the consumer of the predictions

which can be pushed or pulled to the next element.

On-premises Cloud

IaaS PaaS SaaS

█ Data

█ Applications

█ Runtime

█ Middleware

█ O/S

█ Virtualization

█ Servers

█ Storage

█ Networking

█ Data

█ Applications

█ Runtime

█ Middleware

█ O/S

█ Virtualization

█ Servers

█ Storage

█ Networking

█ Data

█ Applications

█ Runtime

█ Middleware

█ O/S

█ Virtualization

█ Servers

█ Storage

█ Networking

█ Data

█ Applications

█ Runtime

█ Middleware

█ O/S

█ Virtualization

█ Servers

█ Storage

█ Networking

Explication:

█ Self-managed

█ Provider-supplied


Figure 6.4: Common architecture patterns in practice

Data sources are placed in the figure as a generic block with no further specification as the

patterns focus on handling the ML model and not the data. Furthermore, the figure does not

show an exhaustive list. There may be additional but less common patterns as well as

individual architectures through the variation and combination of existing patterns. In the

following, the most common styles are described in more detail.

Shared Database

The first architecture is the one with the lowest complexity. The ML model is handled in a

Python script, which brings the model in the correct format and provides a function to

calculate predictions. These predictions then are saved into a shared database, hence the

name of the pattern. Relevant users can easily access the predictions from a web app or

other applications (Akyildiz, 2020b; Samiullah, 2019).

By delivering the prediction results via a database, which can be an already existing one in

the organization, the complexity of the overall system is kept at low levels. However, the

execution of the script is not triggered in real time by the end user. Rather, it is executed after

a defined schedule either manually or automatically through a job scheduler. Due to the lack

of real time capability, the shared database pattern mainly serves as a proof of concept. This

means that it is a good way for an initial deployment to bring results into production. But for

situations requiring good scalability and predictions in real time, it is not the preferred choice

(Samiullah, 2019).

Web App

Database

Prediction by batch

Web app

Model separate (result

delivery via database)

Offline learning

On-premises hosting

Shared Database

Script

ML Model

Data Sources

Web App

Database

Prediction in real time

Web app


delivery via database)

Offline learning

Cloud hosting

In-database

ML Model

Data Sources

Native App


Native app

Model embedded (result

delivery within app)

Offline learning

On-premises hosting

In-app

ML Model

Data Sources

Webservice

Web App


Web app


delivery via REST API)

Offline learning

Cloud hosting

Webservice

ML Model

Data Sources

Streaming Platform

Web App


Web app


delivery streaming)

Online learning

Cloud hosting

Streaming

ML Model v1

ML Model v2

Data Sources

Explication:

█ Push

█ Pull


In-database

By integrating the ML model directly into a database, the complexity increases but

predictions can be made in real time. Data from data sources is saved into the database and

the prediction is directly made. Thus, the person using the prediction can access the

database and retrieve the necessary data. As a limitation of this pattern, only databases with

ML capability can be used and the realization is highly dependent on the provider (Kervizic,

2019).

In-app

A different possibility for designing the architecture is to embed the ML model into a native

app. This pattern typically is used when running the computation on edge devices, which falls

into the category of on-premises hosting. Calculating predictions in-app on a mobile device

has the advantage of not needing any external connection which increases data security.

However, it comes with limitations such as the choice of frameworks for the specific device

and the computing power of the device. Data is not sent to a separate service for prediction

purposes, so that the device itself needs sufficient computation capability (Sato et al., 2019).

A concrete realization of this pattern is the integration of an ML model into the control

software of machine tools if the machine’s manufacturer allows interfering with its software.

As a model update requires to install a new app version on all consuming devices, the

scalability is bad (Kervizic, 2019).

Webservice

A common pattern in practice is to wrap the model in a webservice and deploy it as a

separate service (Akyildiz, 2020b; Pinhasi, 2020). The communication between the web app

and the webservice works in form of a REST API. A good scalability is achieved by using

existing approaches for webservices which are designed for handling high traffic through

measures such as load balancer. Moreover, cloud hosting of the service allows accessibly

from many different devices and locations. The system management difficulty is medium

combined with a good scalability, which results in this pattern being the best trade-off

between complexity and performance for many situations (Samiullah, 2019).

Streaming

A streaming architecture is also characterized by a separate ML model serving from the

consumer but works following the push principle. Data streams from production enter the

streaming platform, which then allows to train and predict in real time based on the incoming

data, and the prediction is pushed to the consuming application (Kervizic, 2019; Sato et al.,

2019). As it is hosted separately, a good scalability even for high volume of data is given.

The pattern’s biggest disadvantage is the very high complexity (Samiullah, 2019). Setting up

and running a streaming architecture requires a high level of maturity and effort. Thus, for a

given use case it must be analyzed in detail if the advantages outweigh the disadvantages.


Table 6.1 summarizes the presented patterns by evaluating them regarding scalability and

complexity. As described before a webservice architecture allows a good scalability

combined with medium complexity.

Table 6.1: Evaluation of architectures

Shared

Database

In-database In-app Webservice Streaming

Scalability Medium Medium Bad Good Good

Complexity Low Medium Medium Medium High

6.2 Productionizing & Testing

In the deployment design, a target architecture for deploying an ML model into production is

defined, which then is to be implemented. Productionizing is understood as a series of

implementation tasks in order to bring a model from a research to a production environment

(Wheeler, 2019). During this transfer, testing represents a crucial aspect (Breck et al., 2017).

Before diving into the implementation steps, pre-considerations about the involved

environments are presented.

6.2.1 Pre-considerations: Environments

An ML model is developed in a research environment and deployed to a production

environment. Both environments are very different from each other. In the research

environment, the model is handled in a notebook by the data scientist. It is separate from

customer-facing software, so that experiments can be easily run. In contrast, the production

environment is live and accessible for the customer. Issues regarding scalability,

reproducibility and infrastructure planning must be considered (Galli & Samiullah, 2021).

Figure 6.5 illustrates the transition from the ML model development in a research

environment to the software development of the application including the ML model. This

application is not directly deployed to production but goes through the typical four tiers of

environments in software development: development, testing, staging and production

(Murray, 2006). The application is developed, then tested with respect to the integration with

other components, released to a pre-production environment awaiting approval before finally

being deployed to a live production environment.


Figure 6.5: Environments for ML model development and ML software development

6.2.2 Implementation Steps

Following the same steps as regular software development, the implementation process in

Figure 6.6 shares commonalities with the DevOps cycle introduced in chapter 3.3. First,

requirements are defined in the planning phase. Then, the application is developed based on

said requirements. Subsequently, tests are defined and executed in order to ensure that the

application is working as planned. Finally, the program is released and rolled out to the user.

These consecutive steps can be automated.

Figure 6.6: Sequence of implementation steps

In comparison to traditional software, the process for deploying ML software is even more

complex. Whereas regular software is subject to changes in the code, ML software needs to

consider the changes in the data and the model additionally (Sato et al., 2019).

Compared to the deployment design with a limited number of parameters, the

productionizing & testing is characterized by being a complex process with no limited number

of options. A key factor for a successful deployment process is the application of best

practices from software engineering (Serban et al., 2020). Thus, for each step the most

relevant aspects are listed with a focus on ML-specific aspects.

6.2.2.1 Plan

In the plan phase, requirements for the development process are defined. Planning

comprises tasks which are common for any kind of software development project as well as

factors which are only needed for the deployment of ML.

TestingDevelopment Staging ProductionResearch

ML Model ML Software

Plan

Define application

requirements

Develop

Create application

based on

requirements

Test

Define and execute

tests on the

application

Release

Roll out application to

users

Automate

Execute steps in an

automated manner


Project Management

The scope of the development project depends on the specifications of the ML system

defined in deployment design (chapter 6.1). Among further decisions, it primarily includes

specifying the number and type of applications to be developed in accordance with the

selected architecture pattern and the existing IT landscape in the organization.

For managing the project during its execution, frameworks such as Scrum (chapter 3.3) can

be applied. Activities for project management do not belong to ML-specific tasks and, thus,

are not covered in more detail at this point.

ML Functionality

From an ML perspective, the required functionality of the application is to be specified. It

should be defined in an early stage of the project life cycle, either during the business

understanding in CRISP-DM (chapter 3.2.2) or as an input of the stakeholders before the

deployment.

The most basic functionality which must be given in production is the possibility of generating

predictions. Further options include the evaluation of models, which is covered in the

monitoring phase of the deployment (chapter 6.3), or the capability of creating new updated

models, which is described in the retraining phase in chapter 6.4.

For prediction-making and also retraining, data cannot be used as it is but requires

preprocessing as described in the data preparation of the CRISP-DM (chapter 3.2.2). For the

development, it is to be specified which data preparation steps are executed by the ML

application. In many cases, the steps in training do not coincide with the steps for predictions

so that a detailed definition of data preparation functionality is indispensable.

Continuous Data Integration

The ML system relies on the input of data provided by other systems leading to so-called

data dependencies (Sculley et al., 2015). In order to address the data dependencies, it is

necessary to clearly define how the data, which the model needs for making prediction

during the serving phase, is continuously integrated. In production settings, data is typically

saved in databases due to the ease of use and flexibility. These databases reside either on-

premises or in the cloud. Alternatively, data sources can be data streams directly from the

machine or centralized data warehouse. The goal of this planning task is to ensure that the

ML application is linked correctly to the existing data infrastructure. Therefore, not only the

possible data sources need to be defined, but also the data input format, being a file or SQL

request, needs to be specified.

Multiple ML Models

Not always a single model is deployed but multiple models in form of duplicate models,

specialized models, stacked models, cascaded models, or competing models (Sato et al.,

2019). Duplicate models are multiple models performing the same task which allows to


distribute a request between models if the response time of one algorithm is long.

Specialized models each have a different purpose, e.g., one for each product. Stacked

models all have algorithms which combined form a more powerful predictive model. For

cascaded models, the traffic is routed to an alternative model if the baseline model produces

a prediction with low confidence or a baseline model makes a first prediction and based on

first prediction forwards the task to a specialized model for so-called refinement. Competing

models work through the allocation of data traffic across several competing models to make

the best prediction. In all these cases, incoming data must be guided through the system to

the different models, which may even have different data preparation steps.

6.2.2.2 Develop

Based on the defined requirements, the application is developed. Again, the focus is on

aspects which are especially relevant in the context of ML deployment. General

considerations for any software development project such as best practices are not covered

in detail in this work.

Code Tracking

For version control and collaboration, code changes must be tracked. The two main options

for code tracking are depicted in Figure 6.7. In the GitHub workflow new features are

developed in separate branches and then pushed to the master branch which is in

production. Alternatively, the GitLab workflow separates the master from the production

branch. This separation is suitable for situations in which it is not possible to deploy every

time a feature branch is merged, e.g., for fixed deployment time windows. In comparison, the

GitHub flow is simple, clean and straightforward and, thus, more suitable for less complex

scenarios. In any case, the selected workflow needs to be aligned with the set-up of

environments and can deviate from the two presented ones.

Figure 6.7: GitHub vs GitLab workflow (GitLab, 2021)

Data and ML Model Versioning

Not only the code is to be tracked but also the data and the ML model. For this purpose,

models are versioned to allow comparability between different versions. At the same time,

GitLab Flow

Master

Production

New

Feature

Master

GitHub Flow


each model version is linked to the respective training data in order to trace the data used for

the training of each model. Data preprocessing steps, as part of the data or the model,

require tracking as well. Data and ML model versioning ensure that decisions which were

based on a model’s prediction can be reproduced. Challenges in this context are the high

volume of data to store and a clear definition of how models are versioned (Amershi et al.,

2019).

ML Code Structure

Code enables the defined ML functionalities with its structure being dependent on the set-up

of the data preprocessing and possible multiple models. The regular way of structuring the

ML code is procedural programming, which can be seen on the left side of Figure 6.8.

Functions for data handling and calling the algorithm are written separately and are called

one after another. It has the advantage that the code from notebook in the research

environment, typically a Jupyter Notebook, can be adopted to a large extent. As a downside,

all functions are debugged separately and, thus, increase the effort. Alternatively, the data

preparation steps and the final algorithm can be joined into one exportable pipeline object. In

comparison to procedural programming, pipelines have a pre-defined structure that must be

complied with. Consequently, the effort for transforming the notebook code into a pipeline

object, which can be custom or provided by third parties (e.g., scikit-learn), is high if pipelines

are not introduced already in the research environments (Galli & Samiullah, 2021).

Figure 6.8: Procedural programming vs pipeline structure

ML Model Serialization Format

In order to be used as an object which can be integrated into an application, a model must be

serialized. In other words, the pipeline object respectively the trained algorithm is

transformed into one file, which facilitates versioning. Although the python standard format is

pickle, there are many more serialization formats. Specialized formats are available for other

types of algorithms or frameworks (Dowling, 2019). In exceptional cases, no serialization is

needed, e.g., for unsupervised learning algorithms that are run through a script.

Build Format

Not only the model but the whole application around the model must be made executable

and runnable. With the exception of directly runnable python scripts, the build format of the

program needs to be specified. Mainly, there are two options to take into account in the form

of packages and containers. Packages are bundles of files that are written for a target

operating system and need to be installed trough a package manager to run an application,

mostly on virtual machines. Containers are isolated sandbox environments that contain all

Data

preparation

Function

…

Function

Data

preparation

Function

Algorithm

Function Pipeline Object

Data

preparation…

Data

preparationAlgorithm


necessary resources to run an application and share the kernel with other applications.

Multiple containers can be managed through an orchestration service (Fagerberg, 2015).

In Figure 6.9 the differences between the options are illustrated. For bare metal, an

application is installed directly on the host operating system. Virtual machines are used to

create multiple separate units all based on the same hardware but with an own guest

operating system (OS). As containers share the same operating system with other containers

and do not have their own guest OS, they are more lightweight but also limited to the host

operating system.

The main advantage of containers is that the application is not installed but already contains

all necessary information to be run. Thus, it is ensured that it is executed correctly on any

host. Due to this strength, the use of containers is increasing in popularity. Especially in

situations with no direct access to the production server, applications can be developed and

tested remotely as containers and then transferred to the deployment infrastructure.

Figure 6.9: Bare metal, virtual machines, and containers based on Kominos et al. (2017)

6.2.2.3 Test

The developed application is subject to thorough testing (Breck et al., 2017). As shown in

Figure 6.10, tests on different levels can be executed regarding the introduced dimensions of

code, model and data.

Hardware

Host Operating System

App App App

Bare Metal

Hardware


Virtual Machines (VMs)

Hypervisor

App App

Guest OS

Virtual Machine

Bin/ Library

App App

Guest OS

Virtual Machine

Bin/ Library

Hardware


Containers

Container Runtime

App

Container

Bin/ Library

App

Container

Bin/ Library


Figure 6.10: Testing pyramid

Unit tests as the fundamental base for the testing pyramid are used to test components

during the development. One level up, there are integration tests which ensure that multiple

components work together as required. On the top of the pyramid, end-to-end tests validate

the whole application through real user scenarios. For each level, there are many different

tests available out of which Table 6.2 shows the most important ones with respect to the ML

deployment. The main challenge at this point is reproducibility between the research

environment and the production environment (Galli & Samiullah, 2021).

Table 6.2: Tests according to Sato et al. (2019)

Type of test Artifacts Test Description

Unit Data Data test: Validate data against schema or distributions

Integration Code and Model Contract test: Validate that the expected model interface is

compatible with the consuming application

Integration Model and Data Model quality test: Evaluate model performance through

metrics against a performance baseline

Model bias and fairness test: Check performance across

different slices of the data

Integration Code, Model and

Data

Consistency test: Validate that the exported model produces

the same results as the original one against a validation data

set

End-to-end Code, Model and

Data

End-to-end test: Validate the whole application

Unit

Tests

Integration

Tests

E2E

Test

Code Model Data


6.2.2.4 Release

When it comes to releasing a tested and making it accessible in the live production, there are

different roll-out strategies available which are generally valid for conventional and ML

software. The roll-out strategy specifies the way of substitution of a live version of application

with a newer one. In this scenario, version A is currently active and shall be replaced by the

updated version B. The following strategies can be distinguished (Posta, 2015; Tremel,

2017):

• Recreate: Version A is terminated then version B is rolled out.

• Ramped (also known as rolling-update or incremental): Version B is slowly rolled out

and replacing version A.

• Blue/Green: Version B is released alongside version A, then the traffic is switched to

version B.

• Canary: Version B is released to a subset of users, then proceed to a full rollout.

• A/B testing (not for software release but to test features of the application): Version B

is released to a subset of users under specific condition.

• Shadow: Version B receives real-world traffic alongside version A and does not

impact the response.

The selected strategy is to be aligned with the number of models and the environment set-up

as for all strategies, except the ramped roll-out, the new version is deployed alongside the

old one.

6.2.2.5 Automate

The automation of the previously described steps promises a gain in efficiency and

deployment speed. Before automating, a manual execution is necessary in order to gain a

deep understanding of the whole process. Automated deployment achieves a reduced

possibility of errors, saving time, consistency and repeatability (Simek & Slomkova, 2021).

Illustrated in Figure 6.11, different levels of automation can be found that are developed for

conventional software but equally come to application for ML software (Sato et al., 2019).


Figure 6.11: Degrees of Automation based on Chigira (2019)

6.3 Monitoring

Once a model is released to production, monitoring is a key consideration for ensuring

production-readiness of an ML system (Breck et al., 2017). In the software development

(chapter 3.3), monitoring serves as the last DevOps task at hand which closes the cycle. Like

the productionizing & testing, monitoring combines traditional software development with

ML-specific aspects.

6.3.1 Pre-considerations: ML Model Decay

From an ML perspective, all models have in common to deteriorate over time with only the

speed of the decay varying (Samuylova, 2020). Models in stable environments may achieve

a constantly high quality over a long period of time, in other cases the quality decreases

quickly. In any case, the following phenomena cause the ML model decay in the first place.

Data Drift

The data drift describes a change in data distributions (Samuylova, 2020). A shift in the

distribution in the input variables is called covariate shift, whereas a shift in the predicted

output, e.g., the predicted class, is captured under the term of prior probability shift (Stewart,

2019).

There are two scenarios in which drift can occur (Saha & Bose, 2021). Either data

distributions are compared between two different points in time after deployment or between

training and production data. The possible mismatch between data used for training and data

from live production is referred to as training-serving skew (Samuylova, 2020).

Develop Test

Push to Pre-

Production-

Stage

E2E

TestAuto Auto Auto

Develop Test

Push to Pre-

Production-

Stage

E2E

Test

Release to

ProductionAuto Auto Auto Manual

Develop Test

Push to Pre-

Production-

Stage

E2E

Test

Release to

ProductionAuto Auto Auto Auto

Continuous Integration (CI)

Continuous Delivery (CD)

Continuous Deployment


Concept Drift

The fact that relationships between the model inputs and outputs can change is called

concept drift (Samuylova, 2020). Even if the data distributions remain the same, the model

may not describe the real world as well as before. The change in relationship can be gradual,

sudden, or even seasonal. As an example of gradual concept drift from manufacturing, the

mechanical wear of equipment causes slightly different results under the same process

parameters (Samuylova, 2020).

6.3.2 Monitoring Levels

Monitoring is needed to detect the beforementioned phenomena. Based on Waterworth

(2019), there are three layers to the problem as shown in Figure 6.12. Starting from the

bottom of the pyramid, it is necessary to understand what a system is doing, before being

able to monitor a system and ensuring that it is working as planned. Monitoring itself does

not create any value but always requires an analysis on how to improve the system.

Figure 6.12: Levels of monitoring

6.3.2.1 Understand

The goal is observability, which means making the behavior observable. There are three

ways of achieving the goal called the pillars of observability (Sridharan, 2018):

• Metrics

• Logs

• Traces

Metrics

Metrics are a numeric representation of data measured over intervals of time (Sridharan,

2018). Saha and Bose (2021) build a model monitoring metrics stack with three different

types of metrics. Firstly, there are operations metrics for identifying ML system health issues

Understand

What is the system doing?

Monitor

Is the system working?

Analyze

How to improve the system?


including latency, memory and CPU usage as well as system uptime. Operational metrics

are independent of both the underlying data and the ML model. The second type of metrics

are performance metrics which only depend on the ML model by measuring its performance

over time. ML-specific metrics are applied which comply with the respective learning task.

Whereas performance metrics allow to identify a concept drift, stability metrics as the third

component of the metrics stack aim to detect data drifts. In doing so, stability metrics, e.g.,

the Population Stability Index and Characteristic Stability Index, depend on underlying data

and the ML model.

Logs

An event respectively data log is an immutable, timestamped record of discrete events that

happened over time (Sridharan, 2018). Logs are used to capture events like user access or

errors as well as data which was given to the model as prediction input. From both an

operational and ML-specific view, standardized logging messages facilitate the monitoring

process.

Traces

A trace is a representation of a series of causally related distributed events that encode the

end-to-end request flow through a distributed system (Sridharan, 2018). It allows to follow a

signal through the whole system including all services involved in the request and

understand where issues may arise.

Explainability goes one step further than observability and makes the decisions not only

observable but humanly interpretable by the end user (Bhatt et al., 2020). In manufacturing

processes, explainability can increase the trustworthiness of predictions if the model is not

seen as a black box (Goldman et al., 2021). In the AutoML pipeline (Figure 5.1), certification

of the ML model is the very last step. Explainability is one key element for certification, but

due to the complexity of the topic it is not further elaborated here.

6.3.2.2 Monitor

Based on the understanding of the system, the monitoring itself can take place. Two main

approaches for monitoring with diverging purposes exist.

Dashboards

On the one hand, dashboards offer an overview of a system’s state by providing multiple

metrics (Newman, 2016). This high-level overview focuses on metrics as metrics aggregate

information but also includes logs and traces.

Alerts

Alerts, on the other hand, notify a specified recipient of critical conditions of the system

based on certain pre-defined thresholds (Newman, 2016). Thresholds are based on metrics,

but notifications can also be sent in connection with logs and traces.


The way how dashboards and alerts are set up depends on the character of the application.

For containerized applications, existing and standardized monitoring solutions are the way to

go. Nonetheless, monitoring functionalities can also be integrated into the main application

which results in additional requirements in the planning phase of the developed application.

6.3.2.3 Analyze

Once a problem is detected, a root cause analysis as depicted in Figure 6.13 allows to

identify if the problem is of an operational nature. If an operational problem occurred, a step

back to productionizing & testing is made. If the problem is not of technical nature but caused

by the decrease of the ML model’s performance, the last step of the methodology, the

retraining, is realized.

Figure 6.13: Analysis flow chart

For the purpose of a root cause analysis, all available information in form of metrics, logs and

traces are used. Technical problems or operational performance issues are identified through

error logs. Standardized logging messages facilitate the analysis. As this topic is also very

relevant for traditional software, approaches for automating and efficient execution of

debugging are to be considered.

Regarding the ML models, the two described causes for ML model decay, data drift and

concept drift, need to be analyzed using logged data from real-life production. Statistical

testing allows to identify data drift and outliers (Ackerman et al., 2020). Similarly,

sophisticated methods can be applied for the detection of concept drifts (Nishida &

Yamauchi, 2007). Similar to the debugging, a standardized and automated procedure for the

analysis is to be strived for.

6.4 Retraining

In the monitoring step preceding the retraining, it is detected if and when a productionized

model needs to be retrained. Furthermore, the root of decreasing model performance is

identified which serves as an important input for the retraining comprising actions based on

the analysis.

Productionizing

& Testing

Retraining

Operational

Problem

Root cause

analysis

Yes

No

Problem

detected


6.4.1 Pre-considerations: Retraining Effect

As a remedy to the unavoidable degradation of model performance over time, models need

to be refreshed. Figure 6.14 shows the decrease of the quality of static models which are not

retrained. Only through retraining, a constantly high model quality can be achieved.

Figure 6.14: Impact of refreshing on model quality based on Thomas and Mewald (2019)

6.4.2 Retraining Decisions

Figure 6.15 illustrates the relevant decisions made regarding the retraining. These

interconnected decisions refer to the trigger, extent and execution of retraining (Patruno,

2019).

Figure 6.15: Retraining decisions

Trigger of Retraining

It is to be determined how the retraining is triggered. Mainly, there are two main approaches

(Patruno, 2019). One option is to retrain a model based on alerts and the subsequent

analysis in the monitoring phase. Alternatively, the moment of retraining complies with a fixed

schedule. This periodic retraining is used for recurring events or strong seasonal influences.

As seen in use cases in chapter 3.1.2, production highly depends on seasons due to

changes in parameters such as temperature and humidity.

Online learning models represent a special case as they are retrained continuously. The

decision between online and offline learning is made during deployment design in chapter

6.1.1.

Static Models Refreshed Models

Time

Mo

de

l Q

ua

lity

Time

Mo

de

l Q

ua

lity

Execution

How is the retraining executed?Extent

How extensive is the retraining?

Trigger

How is the retraining triggered?


Extent of Retraining

Primarily, retraining refers to re-build an ML model on a new set of training data set without

making any changes to the model itself (Patruno, 2019). The pipeline containing data

preparation steps and an ML algorithm with its hyperparameters stays the same. For the

special case of online learning models, with each new data point the algorithm is trained

which does not include changes to the pipeline.

Depending on the conducted root cause analysis, it may be necessary to tune the model, not

feed the latest data into the existing one (Samuylova, 2020). These adjustments include

changes to features or the selected algorithm and go beyond ingesting new data. Therefore,

this can be referred to as remodeling rather than retraining.

In order to ensure that the new model is improved with respect to the detected data drift or

concept drift, the performance evaluation of the respective set up is crucial. The selection of

adequate measures is described in chapter 3.2.4.

Execution of Retraining

For executing the retraining there are two contrary approaches. In the manual case, activities

for retraining models are executed manually by, e.g., a data scientist. An automated

retraining is especially beneficial if the monitoring is also set up in an automated manner

(Patruno, 2019). AutoML libraries aim to build the whole pipeline from data preparation to

hyperparameter tuning automatically without supervision. AutoML can also be used for

retraining (Kavikondala et al., 2019). Currently, available AutoML tools are not yet mature

and performant enough to fulfill the task satisfactorily (Krauß et al., 2020).

Retraining is the last phase of deployment, but that does not mean the deployment ends at

this point. A retrained model is productionized & tested again followed by monitoring and

ultimately another retraining process.

6.5 General Aspects for Deployment

In parallel to the deployment phases deployment design, productionizing and testing,

monitoring and retraining, there are overarching concepts which represent important factors

for all of the mentioned phases. Specifically, roles and competencies as well as tools and

frameworks are covered in the following.

6.5.1 Roles and Competencies

As presented in chapter 2, a key factor for failed deployments is the coordination between

different stakeholders. Figure 6.16 shows the three involved types of expertise that in

combination allows a successful deployment. There is the process respectively business

competence in form of domain knowledge, data science competence and DevOps

competence (Samiullah, 2019).


Figure 6.16: Collaboration between process, data science and DevOps competence

Data science and DevOps can be analyzed together as it is possible to integrate the two

competence fields into one. Figure 6.17 shows dimensions of data science and DevOps

competencies. These dimensions can be used to evaluate the maturity of an organization.

This maturity evaluation is to be executed before as pre-considerations for the deployment in

form of a business and competence analysis. Applying the technique of a maturity model is

one way of assessing an enterprise understanding their current and target states.

Figure 6.17: Maturity model dimensions based on Hornick (2018)

The underlying key to success for deployment is the collaboration between roles and

responsibilities in the same organization, especially between data science and DevOps. Data

science responsibilities comprise all steps of the CRISP-DM from business understanding to

evaluation. For deployment, the information is passed to a DevOps team from a software

engineering background which industrializes the data science project by recoding in another

Data Science

Competence

DevOps

Competence

Process

Competence

Collaboration

Data Science

& DevOps

RolesData

Awareness

Methodology

Strategy Data Access

Asset

Management

Tools Scalability


language, model evaluation and testing, scheduling, monitoring features and deployment

itself (Gherman, 2020).

In order to address the specific challenges of deployment, a new specialized role in form of

an ML engineer has emerged, which is placed between software engineer and data

scientists. Small companies do not have the resources to employ a data science and

DevOps team but rather require one role covering the whole range from data science to

software engineering (Odegua, 2020).

Involved parties during the ML life cycle including the deployment and the model

maintenance can be managed with the tool of a RACI matrix. Stakeholders are classified as

responsible, accountable, consulted or informed. By means of the matrix, the roles existing in

the specific company are clearly distinguished to enable a successful execution of the ML

project ending with the deployment (Wehrstein, 2020).

6.5.2 Tools and Frameworks

When talking about tools and frameworks, a key factor is the decision between open-source

and closed-source solutions. Table 6.3 shows the respective advantages and disadvantages.

Generally, the pros of open-source are the cons of closed-source and vice versa.

Table 6.3: Pros and cons of open-source and closed-source tools (Matteson, 2018)

Open-source tools Closed-source tools

Pros No direct cost

High flexibility

No licensing requirements

Independency from vendor

Support by vendor

Official documentation

Low complexity

Routine updates

Cons No official support

Poor documentation

High complexity

Slow fixes

Cost of service

Low flexibility

License schemes

Dependence on vendor

Software and hardware help to effectively deploy ML models. In order to find the best tool for

the task at hand, options must be compared regarding the following factors (Odegua, 2020):

• Efficiency: How efficient is the tool or framework in production? Efficiency refers to

usage of resources like memory, CPU, or time. These factors directly affect the

project performance, reliability, and stability.


• Popularity: How popular is the tool in the developer community? High popularity,

especially of open-source solutions, can indicate that a tool or framework works well

and is actively in use. However, there may be less popular, often proprietary

solutions, that are even more efficient.

• Support: How is support for the tool or framework? For open-source solutions, the

availability of resources like tutorials and exemplary use cases provided by the

community defines if a good support is given. For proprietary solutions, the support is

evaluated by the service quality by the provider.

Tools and frameworks are applied in all stages of deployment ranging from solutions to

manage the whole ML lifecycle to specialized software for one task. Therefore, in case of

multiple software solutions all components must be compatible with each other. Furthermore,

the experience of the involved team with said solutions is a relevant factor in the decision.

In the scope of this thesis are medium sized companies in the manufacturing industry. For

this kind of organizations, open-source solutions are preferrable as they cover a huge variety

of functions that do not need to be implemented during the deployment.

7 Verification and Validation 69

7 Verification and Validation

In this chapter, the developed methodology is verified and validated in order to assess the

success of the development. Therefore, the methodology is evaluated in the same manner

as existing approaches, implemented for an exemplary use case and discussed in expert

interviews.

According to Balci (1998, p. 336) verification examines the accuracy of transforming a model

from one form into another. Validation, on the other hand, examines if a model behaves with

satisfactory accuracy consistent to the study objectives. In other words, verification is about

building the model right, whereas validation is about building the right model. The approach

by Balci was developed with respect to models and simulation studies. Conceptual models

like the developed methodology, which only have a descriptive structure, cannot be

evaluated with respect to real world behavior and therefore require different methods to

perform verification and validation (Robinson, 2006, p. 796). Rather, conceptual models must

be validated by analyzing if they contain all the necessary details to achieve the goals of the

study (Robinson, 2014, p. 254).

For this purpose, the standards for system, software, and hardware verification and

validation published by the IEEE Computer Society are applied to the developed

methodology. Following the provided definitions, verification describes the process of

evaluating that a system conforms to requirements imposed at the start of the development.

Validation, on the other hand, is defined as the process of providing evidence that the system

satisfies its intended use and user needs.

7.1 Verification

By means of the verification, it is checked if the procedure meets the content-related

requirements (chapter 5.1.1). As these requirements coincide with the criteria which were

used during the analysis of the state of the art in chapter 4, the methodology is evaluated

exactly like existing approaches. Both evaluations are internal processes executed without

the involvement of external parties. Table 7.1 contains the verification results showing the

degree to which the methodology fulfills the previously defined requirements.


Table 7.1: Evaluation of developed methodology

Explanation:

● Completely fulfilled

◕ Mainly fulfilled

◑ Partly fulfilled

◔ Sparsely fulfilled

○ Not at all fulfilled

Dep

loym

en

t o

f M

L

ML

fo

r P

red

icti

ve Q

uali

ty

Str

ate

gic

Pla

nn

ing

Op

era

tio

nal

Realizati

on

Gu

idelin

e

Tra

nsfe

rab

ilit

y

Own methodology ● ● ● ◑ ● ◕

The developed methodology focuses on the deployment process of ML models into

production, especially for use cases of predictive quality in manufacturing processes.

Strategic aspects are covered in depth providing relevant decisions from a high-level

perspective. However, the operational realization is not treated in equal depth. This is due to

the complexity of the implementation which cannot be covered completely in the scope of

this thesis. Relevant aspects for the implementation are introduced but based on the

provided information more specialized sources need to be consulted. The methodology

serves as a guideline as it illustrates the consecutive steps from end-to-end. It is transferable

to further use cases with the limitation that these use cases fall into the scope of this thesis.

Not all specific predictive quality use cases that might be found in real life, such as image

recognition, are covered.

7.2 Validation

By means of the validation, it is assessed if the described procedure behaves with

satisfactory accuracy in the application. For this purpose, the formal requirements from

chapter 5.1.2 are consulted. The methodology was validated through expert interviews on

the one hand and a practical application on the other.

7.2.1 Expert Interviews

Expert interviews were conducted to check if the methodology represents the defined

observation area, simplifies the overall system to the relevant attributes and elements, is

consistent with reality and is free of contradictions. These criteria cannot be evaluated

without external expertise. Thus, it is necessary to validate the methodology based on the

acceptance of external customers and the suitability for the defined application.


Interviews with Deployment Experts

One-on-one interviews with experts for deployment, who already have realized deployments,

represent a bottom-up approach for validation. It is analyzed if existing deployments provided

by the experts can be re-built with the methodology. Subsequently, a top-down perspective

was taken in the interviews by asking if experts can use the methodology for realizing new

deployments. Based on the interviews, the methodology was completed by adding missing

aspects or resolving discrepancies to their experience.

Workshops with Production Experts

In addition, the methodology was validated in workshops with production experts with the

following structure. First, each phase of the methodology was presented in a separate

workshop. Then, input and feedback by the participants were gathered with respect to

completeness, understandability, and applicability. Based on the participants comments, the

concept was refined by adding and adjusting content to the expressed needs. As the goal of

the work is to provide a guideline with relevant factors for practice, the input of practitioners is

a valuable source for validation. A description of the applied procedure for validation can be

found in chapter A.1. of the appendix.

7.2.2 Practical Application

Through the practical application, it is evaluated if the methodology can be applied by a

specific user in a target-oriented way, is providing useful answers, is easy to apply and

interpret, and if its use is associated with low effort. In form of a case study, the methodology

is applied to the context of predictive quality. Deploying an ML model in order to predict the

product quality in a production process represents a common use case in the manufacturing

industry, especially for high-tech products with strict quality standards.

A real-life data set from semiconductor manufacturing provided by the University of California

Irvine publicly at https://archive.ics.uci.edu/ml/datasets/SECOM was used for the

implementation. Based on an existing performant model, an exemplary deployment was

realized with the help of the developed methodology.

As a first step, the architecture was designed based on the technical requirements described

in the deployment design (chapter 6.1). Predictions are needed in real time and are

consumed through a web app in the browser. The model is embedded into the service to

have only one final application. Given as an input from the model building phase, offline

training is chosen. Finally, the application is hosted locally on-premises. Figure 7.1 illustrates

the architecture setup that was individually defined for the use case.


Figure 7.1: Webservice architecture for use case

Input data is provided in form of CSV files, which are used by the webservice containing the

model to enable the ML functionality. The service is made accessible in the browser, where

the user has the option of triggering predictions and monitoring model versions. Figure 7.2

shows the home page of the service which is called at 127.0.0.1:5000 in the browser’s

address bar.

Figure 7.2: Screenshot of home page of the webservice

Practically, the deployed application behaves like a regular website which makes the

deployed app user-friendly. In the “Prediction” tab, a data file with production data can be

selected and submitted to make a prediction (Figure 7.3). The predictions are then calculated

and presented as indicated in Figure 7.4. Predicted fails are highlighted in red so that the

corresponding worker knows which product requires thorough quality testing. When

accessing the “Monitoring” tab, the active model version is evaluated on a holdout data set

and the respective metrics are presented (Figure 7.5).

Webservice


Web app

Model embedded

Offline learning

On-premises hosting

ML ModelData Input

As CSV files

User Access

Via browser


Figure 7.3: Screenshot of prediction input

Figure 7.4: Screenshot of prediction output


Figure 7.5: Screenshot of monitoring

At this point, an excerpt of the most relevant source code is explained. More source code

including explanations is available in chapter A.2. of the appendix. The implementation of the

application as well as the monitoring and training functionality follows the steps of the

methodology described from chapter 6.2 to 6.4.

In order to access the application through the browser, a local server with Flask is built

through the Python script app.py which launches the whole application. It defines what is

executed when a certain endpoint (e.g., 127.0.0.1:5000/prediction) is accessed.

First, necessary imports are made to be able to use Python libraries. But also, the predict

method of the ML model, located in another folder of the project, is imported.

# imports

import pandas as pd

import os

import joblib

from datetime import datetime

from flask import Flask, request, redirect, url_for, render_template

# import of functionality within the application

import configuration

from ML_model import predict

Then, the app is defined and functions for each endpoint are written.

# definition of the app

app = Flask(__name__)

# standard endpoint

@app.route('/', methods=['GET'])

def home():

# by accessing the endpoint a GET request is triggered

if request.method == 'GET':

# index.html file is returned and displayed

return render_template("index.html")


When going to the prediction endpoint (GET request), the browser lets the user choose a file.

After the input is sent (POST request), the webservice displays the output in form of a table

with the prediction results.

# prediction endpoint

@app.route('/prediction', methods=['GET', 'POST'])

def get_prediction():


# files of production data are listed

files = os.listdir(configuration.PRODUCTION_DATA_FOLDER)

# list of files is passed to prediction_input.html

# html file is returned and displayed

return render_template("prediction_input.html", list_of_files = files)

# by pressing the submit button a POST request is made

if request.method == 'POST':

# with a POST request the predictions are triggered

text = "Time and hour of prediction: " + datetime.now().strftime("%d/%

m/%Y %H:%M:%S")

# get the selected option from the dropdown menu

selected_file = request.form.get("dropdown")

# check if an option was selected

if selected_file != '':

# create empty dataframe

df = pd.DataFrame()

# build path of file

filepath = os.path.join(configuration.PRODUCTION_DATA_FOLDER, sele

cted_file)

# load input data from selected file

input_data = pd.read_csv(filepath)

# fill dataframe with predictions from model

df = predict.get_prediction_df(input_data)

# display prediction_output.html to show the predictions as a table

return render_template("prediction_output.html", pred_to_print = text,

table=df.to_html(index = False, header=True, table_id="result_table"))


For the monitoring endpoint, the evaluation metrics are retrieved from the model and

displayed.

# monitoring endpoint

@app.route('/monitoring', methods=['GET'])

def get_evaluation():


# get version number of model

version_number = predict.get_version_number()

# get evaluation metrics scores for model

scores = predict.get_metrics_scores()

# display results

return render_template("monitoring.html", ver=version_number, acc=scor

es[0], pre=scores[1], rec=scores[2], f1=scores[3])

In order to launch the application with all the beforementioned functions, the main method of

app.py is executed.

# main method

if __name__ == '__main__':

app.run(debug=False)

As stated before, more details on the hands-on realization can be found in the appendix.

8 Conclusion 77

8 Conclusion

In this closing chapter, the components of thesis are passed in review. Furthermore, the

relevance of the work on a social and personal level is highlighted. Finally, an outlook on

future research is given.

Chapter 1 introduced ML as a powerful technology for applications in manufacturing,

especially for predicting quality. Mainly due to a missing standardized procedure, the

deployment of ML models presents a crucial barrier to unfolding the full potential of ML

solutions for businesses. Based on the identification of missing support during the selection

process as the crucial impediment, the goal to develop a methodology for ML model

deployment applied to the context of predictive quality in production was derived.

A more detailed descriptions of the problem in practice was provided in chapter 2. An

analysis of the current state of deploying ML models in practice showed evidence on the

unsatisfactory percentage of successful deployments and the associated waste of resources.

Thereupon, the main challenges leading to the failure of many ML projects were identified so

that they can be addressed in the methodology. Due to the topic’s importance and

highlighted room for improvement, the need for further investigation was derived.

Deploying ML models for predictive quality in production requires the combination of

knowledge about quality management, ML, and software engineering so that a basic

understanding of the three areas of investigation is essential. Thus, chapter 3 introduced

relevant theoretical concepts that are necessary to comprehend existing approaches but also

used in the development of the methodology.

In chapter 4, an analysis of the state of the art was conducted. Criteria were defined in

accordance with the set objective to evaluate existing approaches. Both academic and gray

literature was reviewed in order to fully capture the topic. Through the consultation of multiple

sources, the search results were selected and analyzed. As a result, a research gap could be

identified as existing approaches do not serve to fulfill the set objective to a satisfactory

degree making further research necessary.

Before the elaboration of the methodology itself, it is outlined in chapter 5 by defining the

requirements, narrowing down the scope and establishing the relation to a reference

framework. Precise requirements aim to ensure that the overall objective is fulfilled and need

to be considered before the development of the methodology. Likewise, the research area

must be clearly bounded beforehand. Thereby, the methodology’s structure needs to comply

with a framework which is given as a reference from previous research activities in the field.

The subsequent chapter 6 comprises the development of the methodology covering the

phases deployment design, productionizing & testing, monitoring, and retraining. As a

summary of the developed methodology, Figure 8.1 shows a final overview of the relevant

steps and decisions in each of the four phases. Moreover, roles and competencies

respectively tools and frameworks as overarching aspects for deployment were described in

order to capture the deployment in its entirety.

8 Conclusion 78

After the elaboration of the methodology, it was verified and validated through expert

interviews and practical implementation in chapter 7. By means of the evaluation of the

methodology with defined criteria, it was shown that it fulfills the set goal to a satisfactory

degree. The developed conceptual procedure was validated by a hands-on implementation

which can be used as starting point for deploying ML models in organizations.

By means of the validation, it was demonstrated that this thesis contributes to improving the

deployment process of ML models. In predictive quality applications a successful deployment

leads to an increase in efficiency in production and ultimately to the reduction of cost. In the

context of social responsibility, the work has ethical implications by ensuring the profitability

of production which facilitates the preservation of jobs in the manufacturing industry.

On a personal level, the thesis helped to further develop transversal competencies. An

awareness of contemporary issues was achieved by identifying and interpreting the use of

ML models in the field of industrial engineering and predictive quality in particular. Moreover,

the competence of handling specific instruments relevant for the field of investigation was

enhanced. Data science and ML deployment technologies ranging from the programming

language python to specific libraries and tools were selected and applied.

To conclude this thesis, the developed methodology can be used as a starting point for

further research. Due to the dynamic and extent of ML as an area of investigation, no

concept can claim to be complete and valid for all situations. Therefore, future lines of

research can explore each phase of the methodology in more depth especially the

integration of DevOps techniques into the ML life cycle. Ultimately, all these techniques aim

to automate the whole end-to-end ML pipeline from data integration to deployment. With

respect to possible applications, it can be investigated in the future how the methodology can

be applied to further use cases within predictive quality but also to companies outside of

manufacturing industry.

V Bibliography 79

V Bibliography

Ackerman, S., Farchi, E., Raz, O., Zalmanovici, M., & Dube, P. (2020). Detection of data drift

and outliers affecting machine learning model performance over time.

http://arxiv.org/pdf/2012.09258v2

Ackermann, K., Walsh, J., Unánue, A. de, Naveed, H., Navarrete Rivera, A., Lee, S.‑J.,

Bennett, J., Defoe, M., Cody, C., Haynes, L., & Ghani, R. (2018). Deploying machine

learning models for public policy. In Y. Guo & F. Farooq (Eds.), Proceedings of the 24th

acm sigkdd international conference on knowledge discovery & data mining (pp. 15–22).

ACM. https://doi.org/10.1145/3219819.3219911

Agrawal, S., & Mittal, A. (2020). MLOps: 5 Steps to Operationalize Machine Learning

Models: Automate and Productize Machine Learning Algorithms. Informatica.

https://ai4.io/wp-content/uploads/2020/08/2020-08-

07_5f2d921aa925b_MLOps.resources.asset_.faf63486bc68f826d48f086366e9a96d.pdf

Akyildiz, B. (2020a). How to monitor models. https://bugra.github.io/posts/2020/11/24/how-to-

monitor-models/

Akyildiz, B. (2020b). How to serve models. https://bugra.github.io/posts/2020/5/25/how-to-

serve-model/

Algorithmia. (2019). 2020 state of enterprise machine learning.

https://info.algorithmia.com/hubfs/2019/Whitepapers/The-State-of-Enterprise-ML-

2020/Algorithmia_2020_State_of_Enterprise_ML.pdf?utm_campaign=The%20Batch&utm

_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-

9SrICt7U8VAGt4GwFxt47WmEhatriglgLs_5xcaO6b0zG4wsu7No-l5jLL-ypPEck0QMdT

Alpaydin, E. (2014). Introduction to Machine Learning (3rd ed.). Adaptive Computation and

Machine Learning series / Ethem Alpaydin. MIT Press.

Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., &

Zimmermann, T. (2019). Software engineering for machine learning: A case study. In

2019 ieee/acm 41st international conference on software engineering: Software

engineering in practice (icse-seip) (pp. 291–300). IEEE. https://doi.org/10.1109/ICSE-

SEIP.2019.00042

Ariharan, V., Eswaran, S. P., Vempati, S., & Anjum, N. (2019). Machine learning quorum

decider (mlqd) for large scale iot deployments. Procedia Computer Science, 151, 959–

964. https://doi.org/10.1016/j.procs.2019.04.134

Azevedo, A. (2008). Kdd, semma and crisp-dm: a parallel overview. In Iadis European

conference data mining (pp. 182–185).

Baier, L., Jöhren, F., & Seebacher, S. (2019, June 8). Challenges in the deployment and

operation of machine learning in practice. In Proceedings of the 27th European

conference on information systems (ecis), Stockholm & Uppsala, Sweden.

V Bibliography 80

Balci, O. (1998). Verification, validation, and testing. In J. Banks (Ed.), Handbook of

simulation (pp. 335–393). John Wiley & Sons, Inc.

Benington, H. D. (1983). Production of large computer programs. IEEE Annals of the History

of Computing, 5(4), 350–361. https://doi.org/10.1109/MAHC.1983.10102

Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R.,

Moura, J. M. F., & Eckersley, P. (2020). Explainable machine learning in deployment. In

M. Hildebrandt, C. Castillo, E. Celis, S. Ruggieri, L. Taylor, & G. Zanfir-Fortuna (Eds.),

Proceedings of the 2020 conference on fairness, accountability, and transparency

(pp. 648–657). ACM. https://doi.org/10.1145/3351095.3375624

Bignu, A. (2019). Web apps vs native apps: What is the best choice for a data scientist?

https://medium.datadriveninvestor.com/web-apps-vs-native-apps-what-is-the-best-choice-

for-a-data-scientist-3d31169d2335

bigwater.consulting. (2019). Software development life cycle (sdlc). BIG WATER

CONSULTING (BWC). https://bigwater.consulting/2019/04/08/software-development-life-

cycle-sdlc/

Björklund, T. (2017). Apis for non-techies (like myself). https://medium.com/apinf/apis-for-

non-techies-like-myself-259f60042ba

Boehm, B. W. (1988). A spiral model of software development and enhancement. Computer,

21(5), 61–72. https://doi.org/10.1109/2.59

Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ml test score: A rubric for

ml production readiness and technical debt reduction. In J.-Y. Nie, Z. Obradovic, T.

Suzumura, R. Ghosh, R. Nambiar, & C. Wang (Eds.), 2017 ieee international conference

on big data: dec 11-14, 2017, boston, ma, USA : Proceedings. IEEE.

https://storage.googleapis.com/pub-tools-public-publication-

data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf

Brosset, P., Patsko, S., & Khadikar, A. (2019). Scaling ai in manufacturing operations: a

practicioner's perspective. Capgemini Research Institute.

Brüning, J., Denkena, B., Dittrich, M.‑A., & Hocke, T. (2017). Machine learning approach for

optimization of automated fiber placement processes. Procedia CIRP, 66, 74–78.

https://doi.org/10.1016/j.procir.2017.03.295

Cambridge Dictionary. (2014). Methodology.

https://dictionary.cambridge.org/de/worterbuch/englisch/methodology

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. R., & Wirth, R.

(2000). CRISP-DM 1.0: Step-by-step data mining guide. Copenhagen. SPSS.

Chen, J. (2020). Azure fundamental: Iaas, paas, saas. https://medium.com/chenjd-xyz/azure-

fundamental-iaas-paas-saas-973e0c406de7

Chigira, M. (2019). Continuous deployment tools. https://scoutapm.com/blog/continuous-

deployment-tools

V Bibliography 81

Choo, C. (2018). The cloud models: Iaas vs paas vs saas.

https://www.linkedin.com/pulse/cloud-models-iaas-vs-paas-saas-clara-choo

Crankshaw, D., & Gonzalez, J. (2018). Prediction-serving systems. Queue, 16(1), 83–97.

https://doi.org/10.1145/3194653.3210557

Cumbie, B. A., Jourdan, Z., Peachy, T., Dugo, T. M., & Craighead, C. W. (2005). Enterprise

resource planning research: Where are we now and where should we go from here?

Journal of Information Technology Theory and Application (JITTA), Vol. 7(Iss. 2), 21–36.

https://aisel.aisnet.org/jitta/vol7/iss2/4

Debauche, O., Mahmoudi, S., Mahmoudi, S. A., Manneback, P., & Lebeau, F. (2020). A new

edge architecture for ai-iot services deployment. Procedia Computer Science, 175, 10–19.

https://doi.org/10.1016/j.procs.2020.07.006

Decosmo, J. (2019). What nobody tells you about machine learning.

https://www.forbes.com/sites/forbestechcouncil/2019/04/23/what-nobody-tells-you-about-

machine-learning/#50b479f55ac1

Dowling, J. (2019). Guide to file formats for machine learning: Columnar, training,

inferencing, and the feature store. https://towardsdatascience.com/guide-to-file-formats-

for-machine-learning-columnar-training-inferencing-and-the-feature-store-2e0c3d18d4f9

Druzkowski, M. (2017). Building ml models is hard. Deploying them in real business

environments is harder. https://medium.com/ocadotechnology/building-ml-models-is-hard-

deploying-them-in-real-business-environments-is-harder-c2a0433f527

Eising, P. (2017). What exactly is an api? https://medium.com/@perrysetgo/what-exactly-is-

an-api-

69f36968a41f#:~:text=Application%20Programming%20Interface%20(API),to%20commu

nicate%20with%20one%20another.&text=JSON%20or%20XML”.-

,The%20API%20is%20not%20the%20database%20or%20even%20the%20server,that%2

0can%20access%20a%20database.

en.proft.me. (2015). Types of machine learning algorithms.

https://en.proft.me/2015/12/24/types-machine-learning-algorithms/

Escobar, C. A., Morales-Menendez, R., & Macias, D. (2020). Process-monitoring-for-quality

— a machine learning-based modeling for rare event detection. Array, 7, 100034.

https://doi.org/10.1016/j.array.2020.100034

Everett, G. D., & McLeod, R. (2007). Software testing: Testing across the entire software

development life cycle. Wiley-Interscience.

http://www.loc.gov/catdir/enhancements/fy0739/2007001282-b.html

Fagerberg, D. (2015). Container vs package deployments.

https://lastbytes.wordpress.com/2015/09/15/container-vs-package-deployments/

Figalist, I., Elsner, C., Bosch, J., & Olsson, H. H. (2020). An end-to-end framework for

productive use of machine learning in software analytics and business intelligence

solutions. In M. Morisio, M. Torchiano, & A. Jedlitschka (Eds.), Lecture Notes in Computer

V Bibliography 82

Science. Product-Focused Software Process Improvement (Vol. 12562, pp. 217–233).

Springer International Publishing. https://doi.org/10.1007/978-3-030-64148-1_14

Flach, P. (2012). Machine Learning: The Art and Science of Algorithms That Make Sense of

Data. Cambridge University Press. https://doi.org/10.1017/CBO9780511973000

Forsberg, K., & Mooz, H. (1998). System engineering for faster, cheaper, better. Center for

Systems Management, Inc.

https://web.archive.org/web/20030420130303/http://www.incose.org/sfbac/welcome/fcb-

csm.pdf

Frye, M., & Schmitt, R. H. (2019). Quality improvement of milling processes using machine

learning-algorithms. In 16th imeko tc10 conference on testing, diagnostics and inspection

2019: testing, diagnostics and inspection as a comprehensive value chain for quality and

safety, Berlin, Germany.

Galli, S. (2020). How to build and deploy a reproducible machine learning pipeline.

https://trainindata.medium.com/how-to-build-and-deploy-a-reproducible-machine-learning-

pipeline-20119c0ab941

Galli, S., & Samiullah, C. (2021). Deployment of machine learning models. Udemy.

https://www.udemy.com/course/deployment-of-machine-learning-models/

Garousi, V., Felderer, M., & Mäntylä, M. V. (2019). Guidelines for including grey literature

and conducting multivocal literature reviews in software engineering. Information and

Software Technology, 106, 101–121. https://doi.org/10.1016/j.infsof.2018.09.006

Géron, A. (2018). Praxiseinstieg Machine Learning mit Scikit-Learn und TensorFlow:

Konzepte, Tools und Techniken für intelligente Systeme ((K. Rother, Trans.)) (1. Auflage).

O'Reilly. https://www.oreilly.de/buecher/13111/9783960090618-praxiseinstieg-machine-

learning-mit-scikit-learn-und-tensorflow.html

Gherman, A. (2020). Data engineering and data science collaboration processes.

https://towardsdatascience.com/data-engineer-and-data-science-collaboration-processes-

b2d7abcfc74f

Gisselaire, L., Cario, F., Guerre-berthelot, Q., Zigmann, B., Du Bousquet, L., & Nakamura, M.

(2019). Toward evaluation of deployment architecture of ml-based cyber-physical

systems. In 2019 34th ieee/acm international conference on automated software

engineering workshop (asew) (pp. 90–93). IEEE.

https://doi.org/10.1109/ASEW.2019.00036

GitLab. (2021). Introduction to gitlab flow. https://docs.gitlab.com/ee/topics/gitlab_flow.html

Goldman, C. V., Baltaxe, M., Chakraborty, D., & Arinez, J. (2021). Explaining learning

models in manufacturing processes. Procedia Computer Science, 180, 259–268.


Gonfalonieri, A. (2019). Why is machine learning deployment hard?

https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd

V Bibliography 83

Halstenberg, J., Pfitzinger, B., & Jestädt, T. (2020). DevOps. Springer Fachmedien

Wiesbaden. https://doi.org/10.1007/978-3-658-31405-7

Harlann, I. (2017). Devops is a culture, not a role! https://neonrocket.medium.com/devops-is-

a-culture-not-a-role-be1bed149b0

Harrington, P. (2012). Machine learning in action. Manning Publications Co.

Hornick, M. (2018). A data science maturity model for enterprise assessment. Oracle.

https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-

e4aebca50425/2178fa83-87f2-4bdc-a2ff-

384a5382d3bd/File/146aef5f88d7e7f646fb9280c7b5e25f/a_data_science_maturity_model

_for_enterprise_assessment_wp.pdf

Houghton, J. (2018). Understanding what apis are all about. https://medium.com/vody-

techblog/understanding-what-apis-are-all-about-ff2513b76a55

Hunt, X. (2017). Online learning: Machine lerning's secret for big data.

https://blogs.sas.com/content/subconsciousmusings/2017/10/17/online-learning-machine-

learnings-secret-big-data/

IEEE Computer Society. IEEE Standard for System, Software, and Hardware Verification

and Validation. Piscataway, NJ, USA. IEEE.

IEEE Computer Society. IEEE Standard Glossary of Software Engineering Terminology.

Piscataway, NJ, USA. IEEE.

Jeffcock, P. (2018). What's the difference between ai, machine learning, and deep learning?

Oracle. https://blogs.oracle.com/bigdata/difference-ai-machine-learning-deep-learning

John, M. M., Holmström Olsson, H., & Bosch, J. (2021). Architecting ai deployment: A

systematic review of state-of-the-art and state-of-practice literature. In E. Klotins & K.

Wnuk (Eds.), Lecture Notes in Business Information Processing. Software Business (Vol.

407, pp. 14–29). Springer International Publishing. https://doi.org/10.1007/978-3-030-

67292-8_2

Johnson, K. (2019). Ai predictions for 2019. VentureBeat.

https://venturebeat.com/2019/01/02/ai-predictions-for-2019-from-yann-lecun-hilary-

mason-andrew-ng-and-rumman-chowdhury/

Kavikondala, A., Muppalla, V., Prakasha K., K., & Acharya, V. (2019). Automated retraining

of machine learning models. International Journal of Innovative Technology and Exploring

Engineering, 8(12), Article L33221081219, 445–452.

https://doi.org/10.35940/ijitee.L3322.1081219

Keeney, R. L. (1992). Value-focused thinking: a path to creative decision making. Cambridge

Mass.: Harvard University Press.

Keeney, R. L., & Gregory, R. S. (2005). Selecting attributes to measure the achievement of

objectives. Operations Research, 53(1), 1–11. https://doi.org/10.1287/opre.1040.0158

V Bibliography 84

Kervizic, J. (2019). Overview of the different approaches to putting machine learning (ml)

models in production. https://medium.com/analytics-and-data/overview-of-the-different-

approaches-to-putting-machinelearning-ml-models-in-production-c699b34abf86

Kimera, D., & Nangolo, F. N. (2020). Predictive maintenance for ballast pumps on ship repair

yards via machine learning. Transportation Engineering, 2, 100020.

https://doi.org/10.1016/j.treng.2020.100020

Kominos, C. G., Seyvet, N., & Vandikas, K. (2017). Bare-metal, virtual machines and

containers in openstack. In 2017 20th conference on innovations in clouds, internet and

networks (icin) (pp. 36–43). IEEE. https://doi.org/10.1109/ICIN.2017.7899247

Konstantinidis, F. (2020). Why and how to run machine learning algorithms on edge devices.

https://www.therobotreport.com/why-and-how-to-run-machine-learning-algorithms-on-

edge-devices/

Kotu, V., & Deshpande, B. (2019). Data Science (Second Edition). Morgan Kaufmann

Publishers.

Krauß, J. Automl benchmark in production. https://jonathankrauss.github.io/AutoML-

Benchmark/

Krauß, J., Pacheco, B. M., Zang, H. M., & Schmitt, R. H. (2020). Automated machine

learning for predictive quality in production. Procedia CIRP, 93, 443–448.


Larsen, J. (2019). Why do 87% of data science projects never make it into production?

VentureBeat. https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-

never-make-it-into-production/

Lawton, G. (2020). 7 last-mile delivery problems in ai and how to solve them.

https://searchenterpriseai.techtarget.com/feature/7-last-mile-delivery-problems-in-AI-and-

how-to-solve-them

Lehmann, C., Goren Huber, L., Horisberger, T., Scheiba, G., Sima, A. C., & Stockinger, K.

(2020). Big data architecture for intelligent maintenance: A focus on query processing and

machine learning algorithms. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-

020-00340-7

Lichtenwalter, D., Burggräf, P., Wagner, J., & Weißer, T. (2021). Deep multimodal learning

for manufacturing problem solving. Procedia CIRP, 99, 615–620.


Liu, Y., Ling, Z., Huo, B., Wang, B., Chen, T., & Mouine, E. (2020). Building a platform for

machine learning operations from open source frameworks. IFAC-PapersOnLine, 53(5),

704–709. https://doi.org/10.1016/j.ifacol.2021.04.161

Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, I. (2019). A taxonomy of

software engineering challenges for machine learning systems: An empirical investigation.

In P. Kruchten, S. Fraser, & F. Coallier (Eds.), Lecture Notes in Business Information

Processing. Agile Processes in Software Engineering and Extreme Programming (Vol.

V Bibliography 85

355, pp. 227–243). Springer International Publishing. https://doi.org/10.1007/978-3-030-

19034-7_14

Matteson, S. (2018). How to decide if open source or proprietary software solutions are best

for your business. https://www.techrepublic.com/article/how-to-decide-if-open-source-or-

proprietary-software-solutions-are-best-for-your-business/

Mehta, P., Butkewitsch-Choze, S., & Seaman, C. (2018). Smart manufacturing analytics

application for semi-continuous manufacturing process – a use case. Procedia

Manufacturing, 26, 1041–1052. https://doi.org/10.1016/j.promfg.2018.07.138

Mitchell, T. M. (2010). Machine learning (International ed. [Reprint.]. McGraw-Hill series in

computer science. McGraw-Hill.

Mobley, R. K. (2002). An introduction to predictive maintenance (2. ed.). Butterworth-

Heinemann. http://www.loc.gov/catdir/description/els031/2001056670.html

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning

(Second edition). Adaptive computation and machine learning. MIT Press.

Murray, P. E. (2006). Traditional development/integration/staging/production practice for

software development. https://dltj.org/article/software-development-practice/

Muthusamy, V., Slominski, A., & Ishakian, V. (2018). Towards enterprise-ready ai

deployments minimizing the risk of consuming ai models in business applications. In 2018

first international conference on artificial intelligence for industries (ai4i) (pp. 108–109).

IEEE. https://doi.org/10.1109/AI4I.2018.8665685

Nalbach, O., Linn, C., Derouet, M., & Werth, D. (2018). Predictive quality: Towards a new

understanding of quality assurance using machine learning tools. In W. Abramowicz & A.

Paschke (Eds.), Lecture Notes in Business Information Processing. Business Information

Systems (Vol. 320, pp. 30–42). Springer International Publishing.

https://doi.org/10.1007/978-3-319-93931-5_3

National Research Council. (1995). Unit Manufacturing Processes. National Academies

Press. https://doi.org/10.17226/4827

Neal Analytics. (2020). Machine learning operations (mlops). https://nealanalytics.com/wp-

content/uploads/2020/07/MLOps-Datasheet.pdf

Newman, A. (2016). How to use dashboards and alerts for data monitoring.

https://www.loggly.com/blog/how-to-use-dashboards-and-alerts-for-data-monitoring/

Ngo, Q. H., & Schmitt, R. H. (2016). A data-based approach for quality regulation. Procedia

CIRP, 57, 498–503. https://doi.org/10.1016/j.procir.2016.11.086

Nishida, K., & Yamauchi, K. (2007). Detecting concept drift using statistical testing. In V.

Corruble, M. Takeda, & E. Suzuki (Eds.), Lecture Notes in Computer Science. Discovery

Science (Vol. 4755, pp. 264–269). Springer Berlin Heidelberg. https://doi.org/10.1007/978-

3-540-75488-6_27

V Bibliography 86

Odegua, R. (2020). How to put machine learning models into production.

https://stackoverflow.blog/2020/10/12/how-to-put-machine-learning-models-into-

production/

Oxford University Press. (2020). Definition of deployment.

https://www.lexico.com/definition/deployment

Pääkkönen, P., & Pakkala, D. (2020). Extending reference architecture of big data systems

towards machine learning in edge computing environments. Journal of Big Data, 7(1).

https://doi.org/10.1186/s40537-020-00303-y

Patruno, L. (2019). The ultimate guide to model retraining. https://mlinproduction.com/model-

retraining/

Patruno, L. (2020). The ultimate guide to deploying machine learning models. ML in

Production. https://mlinproduction.com/deploying-machine-learning-models/;

https://mlinproduction.com/what-does-it-mean-to-deploy-a-machine-learning-model-

deployment-series-01/; https://mlinproduction.com/software-interfaces-for-machine-

learning-deployment-deployment-series-02/; https://mlinproduction.com/batch-inference-

for-machine-learning-deployment-deployment-series-03/; https://mlinproduction.com/the-

challenges-of-online-inference-deployment-series-04/; https://mlinproduction.com/online-

inference-for-ml-deployment-deployment-series-05/; https://mlinproduction.com/model-

registries-for-ml-deployment-deployment-series-06/; https://mlinproduction.com/testing-

machine-learning-models-deployment-series-07/; https://mlinproduction.com/ab-test-ml-

models-deployment-series-08/

Patzak, G. (1982). Systemtechnik - Planung komplexer innovativer Systeme: Grundlagen,

Methoden, Techniken. Springer.

Pennington, J. (2019). The eight phases of a devops pipeline.

https://medium.com/taptuit/the-eight-phases-of-a-devops-pipeline-fda53ec9bba

Perrault, R., Shoham, Y., Brynjolfsson, E., Clark, J., Etchemendy, J., Grosz, B., Lyons, T., &

Manyika, J. (2019). The ai index 2019 annual report. AI Index Steering Committee,

Human-Centered AI Institute.

Pilarski, S., Staniszewski, M., Bryan, M., Villeneuve, F., & Varró, D. (2021). Predictions-on-

chip: Model-based training and automated deployment of machine learning models at

runtime. Software and Systems Modeling. Advance online publication.

https://doi.org/10.1007/s10270-020-00856-9

Pinhasi, A. (2020). Deploying machine learning models to production — inference service

architecture patterns. https://medium.com/data-for-ai/deploying-machine-learning-models-

to-production-inference-service-architecture-patterns-bc8051f70080

Posta, C. (2015). Blue-green deployments, a/b testing, and canary releases.

https://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-

releases/

V Bibliography 87

Quintanilla, L., Schonning, N., Kershaw, N., Victor, Y., Wenzel, M., Pratschner, S.,

Potapenko, M., Gronlund, C. J., Alexander, J., Kulikov, P., & Dugar, A. (2019). Machine

learning tasks in ml.Net. Microsoft. https://docs.microsoft.com/en-us/dotnet/machine-

learning/resources/tasks

Rao, A., Likens, S., & Shehab, M. (2019). 2019 ai predictions: six ai priorities you can’t afford

to ignore. PwC US. https://www.pwc.com/us/en/services/consulting/library/artificial-

intelligence-predictions-2019.html

Robinson, S. (2006). Conceptual modeling for simulation: Issues and research requirements.

In Proceedings of the 2006 winter simulation conference (pp. 792–800). IEEE.

https://doi.org/10.1109/WSC.2006.323160

Robinson, S. (2014). Simulation: The practice of model development and use (2nd edition).

Palgrave Macmillan.

Rodríguez, M. Á., Alemany, M. M. E., Boza, A., Cuenca, L., & Ortiz, Á. (2020). Artificial

intelligence in supply chain operations planning: Collaboration and digital perspectives. In

L. M. Camarinha-Matos, H. Afsarmanesh, & A. Ortiz (Eds.), IFIP Advances in Information

and Communication Technology. Boosting Collaborative Networks 4.0 (Vol. 598, pp. 365–

378). Springer International Publishing. https://doi.org/10.1007/978-3-030-62412-5_30

Royce, W. W. (1970). Managing the development of large software systems: Concepts and

techniques. Proc. IEEE WESTCON, Los Angeles, 1–9.

Rychener, L., Montet, F., & Hennebert, J. (2020). Architecture proposal for machine learning

based industrial process monitoring. Procedia Computer Science, 170, 648–655.


Saha, P., & Bose, A. (2021). Mlops: Model monitoring 101.

https://www.kdnuggets.com/2021/01/mlops-model-monitoring-101.html

Salminen, J., Milenković, M., & Jansen, B. J. (2017). Problems of data science in

organizations: An explorative qualitative analysis of business professionals’ concerns. In

E. Y. Li & K. N. Shen (Chairs), The 17th International Conference on Electronic Business,

Dubai.

Saltz, J. (2020). Crisp-dm is still the most popular framework for executing data science

projects. https://www.datascience-pm.com/crisp-dm-still-most-popular/

Samiullah, C. (2019). How to deploy machine learning models.

https://christophergs.com/machine%20learning/2019/03/17/how-to-deploy-machine-

learning-models/

Samiullah, C. (2020). Monitoring machine learning models in production.

https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-

learning-models/

Samuylova, E. (2020). Machine learning in production: Why you should care about data and

concept drift. https://towardsdatascience.com/machine-learning-in-production-why-you-

should-care-about-data-and-concept-drift-d96d0bc907fb

V Bibliography 88

Sarkar, D., Bali, R., & Sharma, T. (2018). Practical machine learning with Python: A problem-

solver's guide to building real-world intelligent systems. Apress.

Sato, D., Wider, A., & Windheuser, C. (2019). Continuous delivery for machine learning:

automating the end-to-end lifecycle of machine learning applications.

https://martinfowler.com/articles/cd4ml.html

Schmitt, J., Bönig, J., Borggräfe, T., Beitinger, G., & Deuse, J. (2020). Predictive model-

based quality inspection using machine learning and edge cloud computing. Advanced

Engineering Informatics, 45, 101101. https://doi.org/10.1016/j.aei.2020.101101

Schmitt, R. H., Kurzhals, R., Ellerich, M., Nilgen, G., Schlegel, P., Dietrich, E., & Krauß, J.

(2020). Predictive quality – data analytics in produzierenden unternehmen. In Internet of

production - turning data into value (pp. 226–253).

Schorr, S., Möller, M., Heib, J., Fang, S., & Bähre, D. (2020). Quality prediction of reamed

bores based on process data and machine learning algorithm: A contribution to a more

sustainable manufacturing. Procedia Manufacturing, 43, 519–526.

https://doi.org/10.1016/j.promfg.2020.02.180

Schwaber, K., & Sutherland, J. (2020). The scrum guide: the definitive guide to scrum: The

rules of the game. https://scrumguides.org/docs/scrumguide/v2020/2020-Scrum-Guide-

US.pdf#zoom=100

Scrum.org. (2020). What is scrum? https://www.scrum.org/resources/what-is-scrum

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V.,

Young, M., & Dennison, D. (2015). Hidden technical debt in machine learning systems. In

C. Cortes, D. D. Lee, M. Sugiyama, & R. Gernett (Eds.), Proceedings of the 28th

international conference on neural information processing systems (2nd ed., pp. 2503–

2511). MIT Press, Cambridge, MA, USA.

https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

Serban, A., van der Blom, K., Hoos, H., & Visser, J. (2020). Adoption and effects of software

engineering best practices in machine learning. In Proceedings of the 14th acm / ieee

international symposium on empirical software engineering and measurement (esem)

(pp. 1–12). ACM. https://doi.org/10.1145/3382494.3410681

Shaik, N. (2019). Unpacking the complexity of machine learning deployments.

https://predera.com/unpacking-the-complexity-of-machine-learning-deployments/

Shalev-Shwartz, S., & Ben-David, S. (2019). Understanding machine learning: From theory

to algorithms (12th printing). Cambridge University Press.

Shrivastava, T. (2016). 8 reasons why analytics / machine learning models fail to get

deployed. https://www.analyticsvidhya.com/blog/2016/05/8-reasons-analytics-machine-

learning-models-fail-deployed/

Simek, P., & Slomkova, K. (2021). Automated deployment.

https://developerexperience.io/practices/automated-deployment

V Bibliography 89

Singh, P. (2021). Deploy Machine Learning Models to Production. Apress.

https://doi.org/10.1007/978-1-4842-6546-8

Singh Bisen, V. (2019). These are the reasons why more than 95% ai and ml projects fail.

https://medium.com/vsinghbisen/these-are-the-reasons-why-more-than-95-ai-and-ml-

projects-fail-cd97f4484ecc

Sridharan, C. (2018). Distributed Systems Observability: A Guide to Building Robust

Systems. O’Reilly Media. https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-

Systems-Observability-eBook.pdf

Stachowiak, H. (1973). Allgemeine Modelltheorie. Springer.

Stewart, M. (2019). Understanding dataset shift.

https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766

Subasi, A. (2020). Practical machine learning for data analysis using Python. Elsevier;

Academic Press.

Svetashova, Y., Zhou, B., Pychynski, T., Schmidt, S., Sure-Vetter, Y., Mikut, R., &

Kharlamov, E. (2020). Ontology-enhanced machine learning: A bosch use case of welding

quality monitoring. In J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres,

O. Seneviratne, & L. Kagal (Eds.), Lecture Notes in Computer Science. The Semantic

Web – ISWC 2020 (Vol. 12507, pp. 531–550). Springer International Publishing.

https://doi.org/10.1007/978-3-030-62466-8_33

Talby, D. (2019). Why machine learning models crash and burn in production.

https://www.forbes.com/sites/forbestechcouncil/2019/04/03/why-machine-learning-

models-crash-and-burn-in-production/#64b9b84c2f43

Thomas, J., & Mewald, C. (2019). Productionizing machine learning: From deployment to

drift detection. https://slacker.ro/2019/09/18/productionizing-machine-learning-from-

deployment-to-drift-detection/

Tremel, E. (2017). Six strategies for application deployment.

https://thenewstack.io/deployment-strategies/

Treveil, M., & Dataiku Team. (2020). Introducing MLOps. O'Reilly Media, Inc.

Turck, M. (2020). Resilience and vibrancy: The 2020 data & ai landscape.

https://mattturck.com/data2020/

Turetskyy, A., Wessel, J., Herrmann, C., & Thiede, S. (2021). Battery production design

using multi-output machine learning models. Energy Storage Materials, 38, 93–112.

https://doi.org/10.1016/j.ensm.2021.03.002

Ulrich, H., Dyllick, T., & Probst, G. (1984). Management. Haupt.

Vafeiadis, T., Ioannidis, D., Ziazios, C., Metaxa, I. N., & Tzovaras, D. (2017). Towards robust

early stage data knowledge-based inference engine to support zero-defect strategies in

manufacturing environment. Procedia Manufacturing, 11, 679–685.

https://doi.org/10.1016/j.promfg.2017.07.167

V Bibliography 90

Washizaki, H., Uchida, H., Khomh, F., & Gueheneuc, Y.‑G. (2019). Studying software

engineering patterns for designing machine learning systems. In 2019 10th international

workshop on empirical software engineering in practice (iwesep) (pp. 49–495). IEEE.

https://doi.org/10.1109/IWESEP49350.2019.00017

Waterworth, S. (2019). Observability vs. Monitoring.

https://www.instana.com/blog/observability-vs-monitoring/

Watts, S., & Raza, M. (2019). Saas vs paas vs iaas: What’s the difference & how to choose.

https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-

choose/

Wehrstein, L. (2020). Crisp-dm ready for machine learning projects.

https://towardsdatascience.com/crisp-dm-ready-for-machine-learning-projects-

2aad9172056a

Wheeler, S. (2019). What does it mean to “productionize” data science?

https://towardsdatascience.com/what-does-it-mean-to-productionize-data-science-

82e2e78f044c

Wohlin, C. (2014). Guidelines for snowballing in systematic literature studies and a

replication in software engineering. In M. Shepperd, T. Hall, & I. Myrtveit (Eds.),

Proceedings of the 18th international conference on evaluation and assessment in

software engineering - ease '14 (pp. 1–10). ACM Press.

https://doi.org/10.1145/2601248.2601268

Yong, B. X., & Brintrup, A. (2020). Multi agent system for machine learning under uncertainty

in cyber physical manufacturing system. In T. Borangiu, D. Trentesaux, P. Leitão, A. Giret

Boggino, & V. Botti (Eds.), Studies in Computational Intelligence. Service Oriented,

Holonic and Multi-agent Manufacturing Systems for Industry of the Future (Vol. 853,

pp. 244–257). Springer International Publishing. https://doi.org/10.1007/978-3-030-27477-

1_19

Zahrani, E. G., Hojati, F., Daneshi, A., Azarhoushang, B., & Wilde, J. (2020). Application of

machine learning to predict the product quality and geometry in circular laser grooving

process. Procedia CIRP, 94, 474–480. https://doi.org/10.1016/j.procir.2020.09.167

Zeiser, A., van Stein, B., & Bäck, T. (2021). Requirements towards optimizing analytics in

industrial processes. Procedia Computer Science, 184, 597–605.


Zinkevich, M. (2016). Rules of Machine Learning: Best Practices for ML Engineering.

http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

Zwicky, F., & Wilson, A. G. (Eds.) (1967). New methods of thought and procedure. Springer-

Verlag.

Icons made by Freepik from www.flaticon.com.

VI Budgeting 91

VI Budgeting

As an addition to the written document, this section covers considerations regarding the

budget when developing and executing a concrete project to deploy an ML model for

predictive quality into production.

Cost of Development of Project

With respect to developing the project, all necessary activities for creating this document

were executed by the author himself. Figure 1 shows the time plan for the whole thesis as

well as the proportions of the activities. An analysis of the current state and problem in

practice comprises 10 % of the total effort. Introducing theoretical fundamentals and

evaluating existing approaches are responsible for 30 % respectively 40 % of the effective

working time. The remaining 20 % are used to validate the methodology and implement a

use case.

Figure 1: Time plan of thesis

The only costs associated with the development of the project is the labor of the author.

Applying an hourly wage of 14 €/h valid for students with a completed bachelor’s degree, the

time plan can be translated into costs as indicated in Table 1. For the total of 300 effective

working hours, the total cost for development of the project is 4200 €.

Cost of Execution of Project

When executing the project and deploying an ML model for predictive quality in production,

there are costs for developing and operating the ML software. Depending on the size of the

company, estimated costs are listed in Table 2.

By summing up the development and execution costs, the total project cost is determined. In

the next step, the total cost is compared to the estimated benefits.

2021

Jan Feb Mar Apr May

Analysis of current state and problem in practice

Theoretical fundamentals and evaluation existing approaches

Outline and development of methodology

Validation and implementation of use case

Jun Jul

Start End

VI Budgeting 92

Table 1: Estimated costs for development of project

Activity Effective hours Associated cost

Analysis of current state and

problem in practice 30 h 420 €

Theoretical fundamentals and

evaluation existing approaches 90 h 1260 €

Outline and development of

methodology 120 h 1680 €

Validation and implementation

of use case 60 h 840 €

Total 300 h 4200 €

Table 2: Estimated costs for execution of project

Cost factors Small company Large company

ML software development

(~ 25 % of total cost) 35.000 € 150.000 €

ML software operation

(~ 75 % of total cost) 105.000 € 450.000 €

Total 140.000 € 600.000 €

Sources: https://www.spheregen.com/cost-of-software-development/,

https://www.lookfar.com/blog/2016/10/21/software-maintenance-understanding-and-estimating-costs/

Benefit of Execution of Project

The value that can be created through the deployment of ML models for predictive quality

highly depends on the respective use case. A real-life case study from the production of

bladed disks (BLISKs), which are important components of turbines such as aircraft jet

engines, serves as an example. Predictive quality measures are estimated to bring the high

rework rate of 25 % down to 15 %. In the case study, annual savings in production of

27.000.000 € are estimated as shown in Table 3.

VI Budgeting 93

Table 3: Estimate savings in BLISK manufacturing

Cost reduction per item Average number of items per

day and factory

Annual savings per factory

3600 € 40 27.000.000 €

Source: https://www.ericsson.com/en/reports-and-papers/consumerlab/reports/5g-business-value-to-

industry-blisk

Advantageousness of Project

As an important note, this cost reduction is not achieved solely by the deployment. Deploying

a model only represents the last step and is preceded by activities such as preparation the

production equipment and building the model itself. In order to realize the savings, the cost of

employing data scientists to build the model and necessary investments in devices for data

acquisition such as sensors must be taken into account. Consequently, the direct economic

impact of deployment is very difficult to measure as the success of the implementation of

predictive quality depends on it.

In the presented example, a comparison between the magnitude of yearly savings of

27.000.000 € with the total cost of the project amounting to 604.200 € provides evidence for

the advantageousness of the project. For other use cases, a detailed analysis of the costs of

the whole life cycle of an ML application including phases before the deployment need to be

considered in the decision.

VII Appendix 94

VII Appendix

Through workshops and the practical implementation, the methodology was validated. In this

appendix, the procedure applied in the workshops is presented and relevant elements of the

source code of the implemented software are explained.

A.1. Workshops

For the workshops with production experts, the tool Miro was used which allows to

collaborate between the participants on a shared whiteboard in the browser. The participants’

input was collected by asking the questions below. Answers were written on the colored

notes and then posted directly to the corresponding section of the methodology.

VII Appendix 96

|

\---Objects

constant_columns.pkl

mostly_empty_columns.pkl

pca.pkl

rfe.pkl

scaler.pkl

trained_model.pkl

trained_pipeline.pkl

A.2.2. API

app.py

The source code of the app.py script was already shown in chapter 7.2.2.

layout.html

In app.py, a request is answered by returning an HTML file. All HTML files are based on the

same template which is defined in layout.html. This file defines the look and feel of the

application by arranging and designing the shown objects.

<!DOCTYPE html>

<html>



<head>

<title>SECOM Deployment</title>

<meta charset="utf-8">

<meta http-equiv="X-UA-Compatible" content="IE=edge">

<meta name="viewport" content="width=device-width, initial-scale=1">



<title>SECOM Deployment</title>



<link rel="icon" href="{{url_for('static', filename='images/favicon-

32x32.png')}}" sizes=32x32>

<link rel="icon" href="{{url_for('static', filename='images/favicon-

16x16.png')}}" sizes=16x16>



<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css'

) }}">

</head>



<body>

VII Appendix 97

<header class="header-basic">

<div class="header-limiter">



<h1><a href="{{ url_for('.home') }}">SECOM Deployment</a></h1>



<nav>

<a href="{{ url_for('.home') }}">Home</a>

<a href="{{ url_for('.get_prediction') }}">Prediction</a>

<a href="{{ url_for('.get_evaluation') }}">Monitoring</a>



</nav>

</div>

</header>

<div class="container">

{% block content %}



{% endblock %}

</div>

</body>

</html>

VII Appendix 98

prediction_output.html

As an example of the HTML files, the code for displaying the prediction outputs is shown

below. It uses the layout template and shows the table of predictions made. By means of

JavaScript, the production fails are highlighted in red in the table.

{% extends "layout.html" %}

{% block content %}

<div class="menu">

<h1>Prediction > Results</h1>



<p>{{pred_to_print}}</p>



<p>{{ table|safe }}</p>

</div>

<script>

// make reference to the table object

var table = document.getElementById('result_table');

// go through table and highlight fails

for (var r = 0, n = table.rows.length; r < n; r++) {

for (var c = 0, m = table.rows[r].cells.length; c < m; c++) {

if(table.rows[r].cells[c].innerHTML == "Fail")

{

table.rows[r].style.backgroundColor = "red";

}

}

}

</script>

{% endblock %}

A.2.3. ML Model

The shown code follows the workflow from training and pipeline building to prediction-making

and monitoring. First, the model is trained in train.py, then a pipeline is created in

build_pipeline.py which then is used to make and monitor predictions in the predict.py script.

VII Appendix 99

train.py

In order to train the model, a data scientist defined the necessary steps as an input for the

deployment. Thus, a big part of the code can be taken from the modeling. In train.py, the

defined steps are executed one after another and objects are exported to be used later on.

# imports

import os

import pandas as pd

import numpy as np

from math import sqrt

import joblib

# imports for sklearn functionality

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.feature_selection import SelectFromModel

from sklearn.decomposition import PCA

from sklearn.linear_model import LinearRegression

from sklearn.feature_selection import RFE

from sklearn.ensemble import RandomForestClassifier

# imports for data balancing

from imblearn.over_sampling import SMOTE, SMOTENC

from imblearn.under_sampling import RandomUnderSampler

# ignore warnings

import warnings

warnings.simplefilter(action='ignore')

# import setting from configuration


from configuration import OBJECT_FOLDER as objectfolder

def get_data():

# load training data file

filepath = configuration.TRAINING_DATA_FILE

return pd.read_csv(filepath)

def adjust_columns(data):

# drop duplicates

data.drop_duplicates(inplace=True, subset=["Time"])

# set time stamp as index

data.set_index(keys=["Time"], inplace=True)

VII Appendix 100

# drop mostly empty columsn

mostly_empty_columns=data.columns[data.isnull().mean()>0.5]

data.drop(mostly_empty_columns, axis=1, inplace=True)

#interpolate

data.interpolate(inplace=True)

data.fillna(method='bfill', inplace=True)

# drop constant features

isConstant = data.nunique() == 1

constantColumns = data.columns[isConstant]

data.drop(constantColumns, axis = 1, inplace=True)

# export objects

joblib.dump(mostly_empty_columns, os.path.join(objectfolder, 'mostly_emp

ty_columns.pkl'))

joblib.dump(constantColumns, os.path.join(objectfolder, 'constant_c

olumns.pkl'))

return data

def scale_data(data):

# divide set into numeric and target

data_numeric = data[data.columns[data.columns != 'Pass/Fail']]

data_target = data[data.columns[data.columns == 'Pass/Fail']]

# create a scaler, train scaler and transform data

scaler = StandardScaler()

scaled = scaler.fit_transform(data_numeric)

# create a new DataFrame with the standardized data and with the original

labels

data_scaled = pd.DataFrame(data = scaled, columns=data_numeric.columns)

# put back the non numeric variable

data_target.reset_index(inplace=True)

data_scaled['Pass/Fail'] = data_target['Pass/Fail']

# export trained scaler

joblib.dump(scaler, os.path.join(objectfolder, 'scaler.pkl'))

return data_scaled

def reduce_dimension(data):

# get the numerical data

data_numeric = data[data.dtypes[data.dtypes == 'float64'].index]

VII Appendix 101

# Execute PCA so that 95% of variance are explained

pca = PCA(.95, random_state=42)

principal_components = pca.fit_transform(data_numeric)

data_principal = pd.DataFrame(data = principal_components)

# save output of PCA as array

x = np.array(data_principal)

y = np.array(data['Pass/Fail'])

# features are selected via linear regression

estimator = LinearRegression()

rfe = RFE(estimator)

selector = rfe.fit(x, y)

# reduce data frame to only the selected variables

selected_features = data_principal.columns[selector.support_]

# reduce variables

data_principal_reduced = data_principal[selected_features]

data_principal_reduced["Pass/Fail"]=data["Pass/Fail"]

# save PCA and RFE as objects for later

joblib.dump(pca, os.path.join(objectfolder, 'pca.pkl'))

joblib.dump(rfe, os.path.join(objectfolder, 'rfe.pkl'))

return data_principal_reduced

def undersample_data(data):

# train test split

train, test = train_test_split(data, test_size = 0.3, random_state=42)

# separate data set into features and target

X = data.loc[:, data.columns != 'Pass/Fail']

y = data.loc[:, data.columns == 'Pass/Fail']

# take majority class and reduce instances, the minority class is not chan

ged

rus = RandomUnderSampler(sampling_strategy='majority', random_state=42)

# execute resampling

X_rus, y_rus = rus.fit_resample(X, y)

#j oining features and target to one dataframe

y_rus.columns = ['Pass/Fail']

train_undersampled = X_rus.join(y_rus)

train_undersampled = train_undersampled.sample(frac=1).reset_index(drop=Tr

ue)

VII Appendix 102

# select randomly and scramble rows

train_undersampled = train_undersampled.append(train.sample(frac=1)[0:500]

, sort=False)

train_undersampled = train_undersampled.sample(frac=1).reset_index(drop=Tr

ue)

return train_undersampled

def fit_classifer(train):

# divide train and test set into X and y each

X_train = np.array(train.loc[:,train.columns !='Pass/Fail'])

y_train = np.array(train.loc[:,train.columns =='Pass/Fail'])

# create algorithm and train it

classifier = RandomForestClassifier(n_estimators = 500, max_depth = 20, ra

ndom_state = 42)

classifier.fit(X_train, y_train.ravel())

# export trained model

joblib.dump(classifier, os.path.join(objectfolder, 'trained_model.pkl'))

def execute_training():

print("Training started.")

# preprocessing steps

data = reduce_dimension(scale_data(adjust_columns(get_data())))

# undersampling of train set

train = undersample_data(data)

# fit model

fit_classifer(train)

# objetcs are saved and exported within functions

print("Training finished.")

# main method

if __name__ == '__main__':

execute_training()

VII Appendix 103

build_pipeline.py

Based on the exported objects from training, a pipeline is built containing all data

preprocessing steps and the trained ML algorithm. Said pipeline is exported as an object for

the next step.

# imports

import os

import joblib

from sklearn.pipeline import Pipeline

from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin


# import auxiliary methods necessary for pipeline

from ML_model import preprocessors as pp

# load objects from training

mostly_empty_columns = joblib.load(filename=os.path.join(objectfolder, "mostl

y_empty_columns.pkl"))

constant_columns = joblib.load(filename=os.path.join(objectfolder, "const

ant_columns.pkl"))

scaler_imported = joblib.load(filename=os.path.join(objectfolder, "scale

r.pkl"))

pca_imported = joblib.load(filename=os.path.join(objectfolder, "pca.p

kl"))

rfe_imported = joblib.load(filename=os.path.join(objectfolder, "rfe.p

kl"))

model_imported = joblib.load(filename=os.path.join(objectfolder, "train

ed_model.pkl"))

# define steps of pipeline

pipeline = Pipeline(

[

('remove_mostly_empty_columns', pp.RemoveMostlyEmptyColumns(variables_

to_drop=mostly_empty_columns)),

('interpolate_missing_values', pp.InterpolateMissingValues()),

('remove_constant_features', pp.RemoveConstantFeatures(variables_to_dr

op=constant_columns)),

('Standard_Scaler', scaler_imported),

('PCA', pca_imported),

('RFE', rfe_imported),

('Random_Forest', model_imported)

]

)

VII Appendix 104

# export pipeline

def dump_pipeline():

joblib.dump(pipeline, filename=os.path.join(objectfolder, "trained_pipelin

e.pkl"))

predict.py

For predictions, the trained pipeline is imported and a method is defined which returns

predictions based on a given data input. This method is called by app.py when the user

requests a prediction or evaluation metrics for monitoring purposes.

# imports

import os

import pandas as pd

import numpy as np

import joblib

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_

score



# import trained pipeline

trained_pipeline = joblib.load(os.path.join(objectfolder, "trained_pipeline.pk

l"))

def get_prediction_df(input_data):

# convert input data into data frame

df = pd.DataFrame(input_data)

# set time stamp as index

df.set_index("Time", inplace=True)

df = df.astype(np.float32)

# save time stamps for traceability

time_stamps = df.index.tolist()

# create product IDs

product_IDs = []

for time_stamp in time_stamps:

id_part1 = time_stamp[2:4]



id_part4 = "{0:0=4d}".format(time_stamps.index(time_stamp))

id_complete = str(id_part1) + str(id_part2) + str(id_part3) + "_" + st

r(id_part4)

product_IDs.append(id_complete)

VII Appendix 105

# get prediction from pipeline

predictions = trained_pipeline.predict(df)

# save predictions with additional information

df_results = pd.DataFrame(

{'Time Stamp': time_stamps,

'Prediction': predictions,

'Product ID': product_IDs

})

# replace numerical values by human-readable ones

df_results["Prediction"].replace(to_replace=-

1, value="Pass", inplace=True)

df_results["Prediction"].replace(to_replace=1, value="Fail", inplace=True)

return df_results

def get_metrics_scores():

# insert holdout data set with target variable

data = pd.read_csv(configuration.HOLDOUT_DATA_FILE)

# set Time as index

data = data.set_index("Time", inplace=False)

# save correct labels

y = data["Pass/Fail"]

# drop label as model requires unlabeled data

X = data.drop('Pass/Fail', axis=1, inplace = False)

# execute prediction

y_pred = trained_pipeline.predict(X)

# calculate accuracy, precision, recall and F1-Score

scores = np.array([accuracy_score(y, y_pred), precision_score(y, y_pred),

recall_score(y, y_pred), f1_score(y, y_pred)])

# round values to 4 decimals

scores_rounded = np.around(scores, decimals =4)

# return rounded scores

return scores_rounded

def get_version_number():

# return current version of pipeline

return configuration.VERSION_NUMBER

VII Appendix 106

A.2.4. Data

With regard to data, all data is ingested in form of CSV files. The example below shows

comma-separated values from sensor data of one specific day, which are used as an input

for prediction-making. Each line contains the 590 sensor values for one instance of data.

2008-10-15.csv

DEPLOYMENT OF MACHINE LEARNING MODELS FOR …

Documents

Transcript of DEPLOYMENT OF MACHINE LEARNING MODELS FOR …