DEPLOYMENT OF MACHINE LEARNING MODELS FOR …
Transcript of DEPLOYMENT OF MACHINE LEARNING MODELS FOR …
MASTER THESIS IN ADVANCED ENGINEERING IN
PRODUCTION, LOGISTICS AND SUPPLY CHAIN MANAGEMENT
DEPLOYMENT OF MACHINE LEARNING
MODELS FOR PREDICTIVE QUALITY IN
PRODUCTION
AUTHOR: HENRIK HEYMANN
SUPERVISOR: ANDRES BOZA GARCIA
EXTERNAL SUPERVISOR: MAIK FRYE
Academic year: 2020-21
Abstract
Abstract
Assuring production quality is one of the key elements for manufacturing, especially in highly
developed countries. At the same time, machine learning is an emerging subject in
investigation. An important area of application for machine learning is the prediction of quality
in production. In practice, deploying a performant model from the development stage into real
world production is often executed unsuccessfully due to the lack of a clearly structured
methodology covering the whole end-to-end process but also the necessary decisions and
steps in detail.
This thesis aims to provide a methodology for machine learning model deployment applied to
the context of predictive quality based on data collected during the production process. To
facilitate predictive quality under consideration of the company’s specific needs and
restrictions, the methodology shall serve as a guideline during the selection process of the
most adequate deployment option.
In order to achieve the goal, a review of academic and gray literature identifies available
options and concepts for deployment. Based on the review, a methodology which analyzes
and structures the possible solutions is developed. For validating purposes, the methodology
is discussed with experts and a use case of a machine learning model from a real-world
manufacturing process is implemented.
The developed methodology provides a clear structure and gives an overview of decisions
and tasks that need to be made for the deployment of machine learning models for predictive
quality in production. Further research could deep dive into individual phases of the
methodology such as the implementation with a software engineering focus.
Keywords: Machine Learning; Predictive Quality; Production Quality
Resumen
Resumen
Asegurar la calidad de la producción es uno de los elementos clave para la fabricación,
especialmente en los países altamente desarrollados. Al mismo tiempo, el aprendizaje
automático es un tema emergente en la investigación. Un área de aplicación importante para
el aprendizaje automático es la predicción de la calidad en la producción. En la práctica, el
despliegue de un modelo de alto rendimiento desde la fase de desarrollo hasta la producción
en el mundo real se ejecuta a menudo de forma infructuosa debido a la falta de una
metodología claramente estructurada que cubra todo el proceso de principio a fin, así como
las decisiones y pasos necesarios en detalle.
Este trabajo final de máster tiene como objetivo proporcionar una metodología para el
despliegue de modelos de aprendizaje automático aplicada al contexto de la predicción de la
calidad en base a datos recogidos durante el proceso de producción. Para facilitar la calidad
predictiva teniendo en cuenta las necesidades y restricciones específicas de la empresa, la
metodología servirá de guía durante el proceso de selección de la opción de despliegue más
adecuada.
Para lograr el objetivo, una revisión de la literatura académica y gris identifica las opciones y
conceptos disponibles para el despliegue. A partir de la revisión, se desarrolla una
metodología que analiza y estructura las posibles soluciones. Para validar la metodología,
se discute con expertos y se implementa un caso de uso de un modelo de aprendizaje
automático de un proceso de fabricación del mundo real.
La metodología desarrollada proporciona una estructura clara y ofrece una visión general de
las decisiones y tareas que deben realizarse para el despliegue de modelos de aprendizaje
automático para la predicción de la calidad en la producción. Futuras investigaciones
podrían profundizar en fases individuales de la metodología, como la implementación con un
enfoque de ingeniería de software.
Palabras Clave: aprendizaje automático; predicción de la calidad; calidad de la producción
I Table of Contents i
I Table of Contents
I Table of Contents ......................................................................................................... i
II Abbreviations ...............................................................................................................iv
III List of Figures ...............................................................................................................v
IV List of Tables .............................................................................................................viii
1 Introduction ...................................................................................................................1
1.1 Initial Situation and Motivation .................................................................................1
1.2 Objective .................................................................................................................2
1.3 Structure .................................................................................................................2
2 Problem in Practice ......................................................................................................4
2.1 Current State of Deploying ML Models in Practice ..................................................4
2.2 Main Challenges for Deployment of ML Models in Practice .....................................4
2.3 Need for Investigation .............................................................................................6
3 Theoretical Fundamentals ...........................................................................................7
3.1 Quality Management ...............................................................................................7
3.1.1 Predictive Quality ............................................................................................7
3.1.2 Exemplary Use Cases in Practice ...................................................................8
3.2 Machine Learning (ML) ...........................................................................................9
3.2.1 Definition .........................................................................................................9
3.2.2 Life Cycle of ML Projects ...............................................................................11
3.2.3 Data Preparation and Modeling .....................................................................12
3.2.4 Evaluation and Deployment ...........................................................................16
3.3 Software Engineering ............................................................................................18
3.3.1 Traditional Software ......................................................................................18
3.3.2 ML Software ..................................................................................................21
I Table of Contents ii
4 State of the Art ............................................................................................................22
4.1 Definition of Evaluation Criteria .............................................................................22
4.2 Literature Review ..................................................................................................24
4.2.1 Step 1: Accumulation and Selection of Publications ......................................24
4.2.2 Step 2: Categorization of Publications ...........................................................28
4.2.3 Step 3: Evaluation of Publications .................................................................29
4.3 Most Relevant Approaches ...................................................................................33
4.4 Theory Deficit ........................................................................................................39
5 Outline of the Methodology........................................................................................40
5.1 Requirements........................................................................................................40
5.1.1 Content Requirements ..................................................................................40
5.1.2 Formal Requirements ....................................................................................40
5.2 Scope ...................................................................................................................41
5.3 Reference Framework ...........................................................................................42
6 Development of the Methodology..............................................................................44
6.1 Deployment Design ...............................................................................................45
6.1.1 Pre-considerations: Design Requirements.....................................................45
6.1.2 Architecture Patterns .....................................................................................49
6.2 Productionizing & Testing .....................................................................................52
6.2.1 Pre-considerations: Environments .................................................................52
6.2.2 Implementation Steps ....................................................................................53
6.3 Monitoring .............................................................................................................60
6.3.1 Pre-considerations: ML Model Decay ............................................................60
6.3.2 Monitoring Levels ..........................................................................................61
6.4 Retraining .............................................................................................................63
6.4.1 Pre-considerations: Retraining Effect ............................................................64
6.4.2 Retraining Decisions .....................................................................................64
6.5 General Aspects for Deployment...........................................................................65
6.5.1 Roles and Competencies ..............................................................................65
6.5.2 Tools and Frameworks ..................................................................................67
I Table of Contents iii
7 Verification and Validation .........................................................................................69
7.1 Verification ............................................................................................................69
7.2 Validation ..............................................................................................................70
7.2.1 Expert Interviews ...........................................................................................70
7.2.2 Practical Application ......................................................................................71
8 Conclusion ..................................................................................................................77
V Bibliography ................................................................................................................79
VI Budgeting ....................................................................................................................91
VII Appendix .....................................................................................................................94
II Abbreviations iv
II Abbreviations
Abbreviation Description
AI Artificial Intelligence
CD4ML Continuous Delivery for Machine Learning
CI/CD Continuous Integration/ Continuous Delivery
CPU Central Processing Unit
CRISP-DM Cross Industry Standard Process for Data Mining
GPU Graphics Processing Unit
IoT Internet of Things
IT Information Technology
KDD Knowledge Discovery in Databases
ML Machine Learning
OS Operating System
QA Quality Assurance
SEMMA Sample, Explore, Modify, Model and Assess
III List of Figures v
III List of Figures
Figure 1.1: Structure of the work based on applied research according to Ulrich ....................2
Figure 3.1: AI, ML, deep learning and data science based on Kotu and Deshpande (2019,
p. 3) ................................................................................................................................9
Figure 3.2: Traditional program and machine learning (Kotu & Deshpande, 2019, p. 3) .......10
Figure 3.3: CRISP-DM (Chapman et al., 2000, p. 10) ...........................................................12
Figure 3.4: Binary and multiclass classification, regression and clustering (Singh, 2021,
pp. 8–10) ......................................................................................................................14
Figure 3.5: Machine learning types according to en.proft.me (2015).....................................14
Figure 3.6: Visualization of reinforcement learning (Singh, 2021, pp. 12–13)........................15
Figure 3.7: Typical Steps in an ML Pipeline based on Galli (2020) .......................................18
Figure 3.8: Implementation steps to develop a large computer program (Royce, 1970) ........19
Figure 3.9: Software development life cycle according to bigwater.consulting ......................19
Figure 3.10: Scrum framework according to Scrum.org ........................................................20
Figure 3.11: DevOps approach according to Harlann (2017) ................................................20
Figure 3.12: Continuous integration, delivery and deployment according to Pennington
(2019) ...........................................................................................................................21
Figure 3.13: MLOps (Neal Analytics, 2020) ..........................................................................21
Figure 4.1: Connected papers to Sculley et al. (2015) from connectedpapers.com ..............27
Figure 4.2: Year distribution of selected publications (Total of 46) ........................................28
Figure 4.3: Type distribution of selected publications (Total of 46)........................................28
Figure 4.4: Illustration of the process chain (Krauß et al., 2020) ...........................................33
Figure 4.5: Predictive model-based quality inspection framework (J. Schmitt et al., 2020) ...34
Figure 4.6: ML Code as small fraction of ML systems (Sculley et al., 2015) .........................34
III List of Figures vi
Figure 4.7: Traditional system and ML-based system testing and monitoring (Breck et al.,
2017) ............................................................................................................................35
Figure 4.8: Continuous delivery for ML end-to-end process ..................................................36
Figure 5.1: AutoML pipeline in the context of production based on Krauß ............................43
Figure 6.1: Overview of methodology ...................................................................................44
Figure 6.2: Morphological box for deployment design ...........................................................46
Figure 6.3: Cloud service levels based on Watts and Raza (2019) and Chen (2020) ...........49
Figure 6.4: Common architecture patterns in practice ...........................................................50
Figure 6.5: Environments for ML model development and ML software development...........53
Figure 6.6: Sequence of implementation steps .....................................................................53
Figure 6.7: GitHub vs GitLab workflow (GitLab, 2021) ..........................................................55
Figure 6.8: Procedural programming vs pipeline structure ....................................................56
Figure 6.9: Bare metal, virtual machines, and containers based on Kominos et al. (2017)....57
Figure 6.10: Testing pyramid ................................................................................................58
Figure 6.11: Degrees of Automation based on Chigira (2019) ..............................................60
Figure 6.12: Levels of monitoring..........................................................................................61
Figure 6.13: Analysis flow chart ............................................................................................63
Figure 6.14: Impact of refreshing on model quality based on Thomas and Mewald (2019) ...64
Figure 6.15: Retraining decisions .........................................................................................64
Figure 6.16: Collaboration between process, data science and DevOps competence ..........66
Figure 6.17: Maturity model dimensions based on Hornick (2018)........................................66
Figure 7.1: Webservice architecture for use case .................................................................72
Figure 7.2: Screenshot of home page of the webservice ......................................................72
III List of Figures vii
Figure 7.3: Screenshot of prediction input ............................................................................73
Figure 7.4: Screenshot of prediction output ..........................................................................73
Figure 7.5: Screenshot of monitoring ....................................................................................74
IV List of Tables viii
IV List of Tables
Table 3.1: Definitions and terminology (Mohri et al., 2018, pp. 4–5) .....................................13
Table 3.2: Confusion matrix (Harrington, 2012, p. 144) ........................................................16
Table 3.3: Classification metrics with formula (Flach, 2012, pp. 53–61; Harrington, 2012,
p. 24; Sarkar et al., 2018, p. 12) ....................................................................................17
Table 4.1: Evaluation criteria ................................................................................................22
Table 4.2: Search strings ......................................................................................................25
Table 4.3: Evaluation of existing approaches .......................................................................30
Table 4.4: Four potential ML system architecture approaches ..............................................37
Table 6.1: Evaluation of architectures ...................................................................................52
Table 6.2: Tests according to Sato et al. (2019) ...................................................................58
Table 6.3: Pros and cons of open-source and closed-source tools (Matteson, 2018) ...........67
Table 7.1: Evaluation of developed methodology .................................................................70
1 Introduction 1
1 Introduction
According to Andrew Ng, an adjunct professor of computer science at Stanford University
and one of the leading personalities in artificial intelligence (AI), there is still enormous
potential to be exploited in many economic sectors (Johnson, 2019):
“I think the next massive wave of value creation will be when you can get a
manufacturing company or agriculture devices company or a health care
company to develop dozens of AI solutions to help their businesses.”
In this introductory chapter, the initial situation and motivation for implementing these
solutions in manufacturing companies are presented. Furthermore, the objective is
formulated followed by the description of the structure of this thesis.
1.1 Initial Situation and Motivation
Machine learning (ML) as one subdomain of AI is an emerging subject in investigation
(Perrault et al., 2019). Main reasons for the growth in the adoption of ML and AI by
businesses are the rise in data, increased computational efficiency, improved ML algorithms,
and availability of data scientists (Singh, 2021, p. 3). With regard to the manufacturing sector,
ML is a useful technique to predict the quality in the production. Companies from
industrialized nations must be able to manufacture products of superior quality at competitive
costs to ensure their competitiveness in a globalized world (National Research Council,
1995, p. 1). Initial applications show that the importance of using ML models for quality
predictions has already been detected (Brosset et al., 2019).
Singh (2021, p. 53) states that ML and AI “are not a silver bullet that can solve all problems”.
The author of this statement refers to the fact that the implementation of ML does not work
without any significant investments including the necessary effort associated with the
deployment. This very step of integrating the model into the running process is a crucial
factor for the success of an ML project as a true benefit for a company is only generated by
making the predictions available to the appropriate users in production (Odegua, 2020).
Currently, transferring a performant ML model from the development stage into real world
production is often conducted in an unstructured and unsubstantiated manner, which cannot
ensure to deliver the best solution for every specific situation. This is caused by the huge
variety of different concepts available for deployment. Due to the field’s novelty and dynamic,
capturing the whole landscape of offered tools results to be extremely difficult (Turck, 2020).
Complicating matters further, decision owners in the top management of companies tend to
lack in-depth knowledge in data science and related fields such as software engineering
(Salminen et al., 2017). Consequently, a structured and well-founded procedure for
deploying ML models is lacking which describes the deployment process in breadth and
depth and assists with the implementation of an ML solution from start to finish.
1 Introduction 2
1.2 Objective
For these reasons, this thesis aims to provide a methodology for ML model deployment
applied to the context of predictive quality in production. Within this field of application, the
focus is set on ML models which ingest tabular data from manufacturing processes to make
inferences on the output quality. To facilitate predictive quality under consideration of the
company’s specific needs and restrictions, the methodology shall serve as a guideline during
the selection process of the most adequate deployment option.
Therefore, the following research question can be formulated based on the defined goal and
is to be answered in this thesis.
1.3 Structure
Structurally, this work is based on the concept for applied research in theory and practice by
Ulrich et al. (1984, p. 193). Figure 1.1 links the seven phases of the applied research
methodology with the eight resulting chapters of this work.
Figure 1.1: Structure of the work based on applied research according to Ulrich
In chapter 1, the relevance of the topic is exposed, the objective is defined, and the
overarching research question is established. This introduction is followed by chapter 2 with
a more detailed description of the problem and its corresponding challenges encountered in
practice. These first two chapters serve to identify relevant problems from practice.
2. Identification and interpretation of
problem-relevant theories and hypotheses
3. Collection and specification of problem-
relevant procedures of the formal sciences
4. Capture and study of the relevant
application context
5. Derivation of assessment criteria, design
rules and models
6. Testing the rules and models in the
context of application
7. Consulting and implementation in
practice
1. Identification and typification of problems
relevant to practice
Applied research in theory and practice
according to Ulrich
Chapter 3 Theoretical Fundamentals
Chapter 4 State of the Art
Chapter 5 Outline of the Methodology
Chapter 6 Development of the Methodology
Chapter 7 Verification and Validation
Chapter 8 Conclusion
Chapter 1 Introduction
Chapter 2 Problem in Practice
Structure of the work
Pra
ctica
l Ap
plic
atio
n
How to deploy ML models for predictive quality in production?
1 Introduction 3
Problem-relevant theories are identified in chapter 3 and 4. On the one hand, chapter 3 lays
the theoretical foundation for the further course of this work by introducing fundamental
concepts of quality management, ML, and software engineering. Chapter 4, on the other
hand, evaluates existing approaches by means of an analysis of the state of the art.
Subsequently, the methodology is outlined in chapter 5 by defining the requirements, the
scope, and the reference framework. The state of the art and outline of the methodology
have in common to collect problem-relevant procedures from formal sciences as well as
capture the application context. After defining the specifications of the concept, the
subsequent chapter 6 is devoted to elaborating the methodology for deploying ML models for
predictive quality in production. Both chapters 5 and 6 comprise the derivation of assessment
criteria, design rules, and models for the methodology.
In chapter 7, the generated solution approach to answer the research question is verified and
validated including the practical implementation in order to test the model in the context of
application. Finally, chapter 8 summarizes and critically reflects the results with regard to the
implications on consulting and implementation in practice before providing a short outlook on
further research.
2 Problem in Practice 4
2 Problem in Practice
This chapter adds more detail to the introductory motivation by analyzing the current state of
the deployment of ML models in production. Furthermore, associated challenges which
companies are facing when confronted with deploying ML models for practical applications
are addressed. In the following, ML and AI solutions are analyzed as one combined topic.
2.1 Current State of Deploying ML Models in Practice
Studies deliver evidence about the actual situation of deployments of AI applications in
companies. Underlying all surveys is the high level of interest in capitalizing on AI, which is
expected to change businesses fundamentally by contributing up to $15.7 trillion to the global
economy by the year 2030 (Rao et al., 2019).
As the concluding results of various studies regarding ML model deployment in practice,
many ML projects never get deployed and only a very small percentage of ML models make
it to production (Gonfalonieri, 2019). Enterprises are discovering it is easier to build a model
than it is to integrate it into existing processes, indicating that even if the model development
phase was successful, the most difficult part is yet to come. As a result of these so-called
last-mile deployment problems, most companies deploy only between 10 % and 40 % of their
ML projects depending on their size and technology readiness (Lawton, 2020). Out of all
pursued projects, 78 % are shut down before reaching the deployment stage (Singh Bisen,
2019). Further sources report that 87 % of data science projects never make it into
production (Larsen, 2019). Even without trying to find the most accurate percentage of
successfully deployed ML models, it becomes evident that the deployment is being executed
in an insufficient manner in most cases.
Resources are wasted for unsuccessful deployments efforts which include not deploying ML
models at all or failing to bring them properly into use. In either case, time and money are
spent with never gaining any profit from using the model. Focusing on surveyed companies
which did manage to deploy a model, just about half of them say they spend between 8 and
90 days deploying one model. 18 % of companies are taking longer than 90 days with some
spending more than a year productionizing. Transferred into actual time spent on
deployment, at least 25 % of data scientist time is spent deploying models (Algorithmia,
2019). As a takeaway, any company wanting to reap the potential of ML needs to prioritize
on efficiently carrying out the deployment.
2.2 Main Challenges for Deployment of ML Models in Practice
After seeing the evidence of failed projects not being the exception but the usual case in
practice, it is necessary to analyze the underlying causes for the problem. Reasons for
ineffective deployments can have an organizational and/ or technological origin (Baier et al.,
2019). A selection of the most critical factors is explained in the following.
2 Problem in Practice 5
High Set-Up and Operation Effort
Setting up an ML system imposes technological challenges such as CPU and memory
usage, scalability, portability and traceability (Gonfalonieri, 2019; Shaik, 2019). Companies
need IT infrastructure that can maintain high availability in order to accommodate spikes in
demand for the ML model (Decosmo, 2019). All these factors, among many others, need to
be taken into account when selecting platforms and tools (Druzkowski, 2017). The range of
offered services comprises standardized programs by big established technology companies
as well as specialized tools by start-ups. Due to the dynamic of the market, the number of
different offerings will increase even more over time (Turck, 2020).
Setting up the system does not only concern technologies, but also the integration of ML
models into the business application (Shaik, 2019). The biggest AI deployment impediment
for most companies consists in providing the infrastructure for connecting AI into the
business. Only if AI is adopted all the way down to the end user, it unfolds its real business
value (Lawton, 2020). Business users need understand and trust ML models when using
predictions in their decision-making process so that a model should be developed for one
specific task to enhance an existing process or solve a well-defined business problem. In
doing so, the idea is to keep it simple and not expect too much too quickly (Decosmo, 2019).
In comparison to the operation of regular software, ML applications require more frequent
actualizations and must be monitored continuously (Shaik, 2019). Monitoring includes
observing the ML model’s performance and watching out for gaps between training and real-
world data. An ML model’s accuracy will be at its best until starting to use it. It is hard to build
ML models that reflect future, unseen behavior if that behavior evolves quickly. A deployed
model interacts with the real world and, thus, changes in the real world can break the feature
a model depends on or can make the prior distributions a model was trained on obsolete
(Talby, 2019).
Missing Coordination and Support
From a purely organizational point of view, the main challenge is the missing alignment
between roles. There might be a lack of understanding of the business problem by the
analyst or too complex models for implementation (Shrivastava, 2016). Therefore, it is
necessary to build diverse teams that include people who have business, IT and specialized
AI skills (Rao et al., 2019). A team should have both practical software development and
model-building experience as many data scientists only have academic experience in
building ML models and lack practical experience in deploying them (Decosmo, 2019). In
order to leverage AI projects successfully, coordination within the team but also across
hierarchies needs to be considered. A project’s success depends on the leadership support
by business leaders and the communication with the decision owners (Larsen, 2019).
2 Problem in Practice 6
2.3 Need for Investigation
All the previously described challenges are already difficult enough, but still companies tend
to underestimate the deployment and the maintenance of ML solutions (Shaik, 2019). As an
interim conclusion, the challenging task of deploying ML models cannot be considered in an
isolated way as there are many different relevant factors influencing the process. Not only
the technological requisites of the specific use case, but also organizational and structural
components within in the company can have a substantial impact. Evidence from studies
shows the increasing relevance of the topic and reveal the existing deficit in practice which
serves as a starting point for developing a methodology. The identified challenges are
addressed and picked up in the development phase.
3 Theoretical Fundamentals 7
3 Theoretical Fundamentals
In this chapter, basic theoretical concepts regarding quality management, machine learning
and software engineering are presented. These theoretical foundations do not yet provide
solutions to the formulated problem but give a contextual understanding of deploying ML
models for predictive quality.
3.1 Quality Management
First, basic concepts from quality management are presented in order to define predictive
quality and show exemplary applications from practice.
3.1.1 Predictive Quality
In general, data has become more and more important in the field of quality and is used to
make decisions for different types of quality-related tasks. As a reference to Industry 4.0, the
term “Quality 4.0” describes the systematic and goal-oriented usage of all available data to
improve quality. Approaches for data-based quality regulation use data mining respectively
ML methods to optimize the processes in order to achieve a demanded product quality (Ngo
& Schmitt, 2016). Predictive quality is defined as “the empowerment of the user to optimize
product and process-related quality by using data-driven predictions as a basis for decision-
making and action” (R. H. Schmitt et al., 2020).
A terminological distinction between predictive quality and predictive maintenance is
necessary. Predictive maintenance is defined as regular monitoring of the operating
condition of production equipment and aims to ensure the maximum interval between repairs
and minimize the number and cost of unplanned interruptions in production. Process-related
indicators are captured to determine the actual operating condition of critical plant systems
and to schedule maintenance activities according to the obtained data. Successfully
executed predictive maintenance activities improve product quality, productivity, operating
efficiency and profitability of production plants (Mobley, 2002, pp. 4–6).
Predictive quality can enable various potentials in practice ranging from the analysis of past
defects to the prediction of future events and the derivation of remedial measures. This can
be achieved through the use of simple statistical methods or complex ML models. The need
for a data-driven predictive approach is caused by the increasing complexity in production
processes, the rising number of immanent interactions between individual processes and a
significant increase in process variance due to the increasing individualization of products.
As a main objective, adequate measures shall be derived from the prediction to optimize the
quality (R. H. Schmitt et al., 2020).
For the purpose of this thesis, predictive quality is defined as the activity of making
predictions about the quality of a product. In contrast, ensuring aspects such as stable
processes, efficient process chains and the fulfillment of requirements by products fall under
the term of production quality.
3 Theoretical Fundamentals 8
3.1.2 Exemplary Use Cases in Practice
The following use cases show examples from real-life production, where either a generic
form of data analysis or specifically ML methods find use to predict the product quality.
Rejects Forecasting in Production Chains
Wasting resources on rejected products gets more expensive along the production chain of
lamps for automotive lighting and LED components. To reduce the reject rate in the last
manufacturing step, a forecasting model for predicting rejects is trained on manufacturing
parameters and inspection data, with the help of which the main influencing variables can be
identified and initial recommendations for action can be derived (R. H. Schmitt et al., 2020).
ML to Predict Product Quality and Geometry in Circular Laser Grooving Process
In the process of circular laser grooving, achieving the desired micro grooves on the
circumference of cylindrical parts depends on the appropriate selection of process
parameters such as workpiece rotational speed or laser power and frequency. A random
forest algorithm is used to derive the most influential input parameters on the outputs with
respect to product quality (Zahrani et al., 2020).
Laser Cutting Process
A laser cutting process is made up of two parallel running sub-processes: guiding of the laser
beam and simultaneously moving the work piece. The final cutting shape results, which
define the product quality, not also depend on the two sub-processes but can also be
influenced by a previous grind process. As there is no simple, direct chain of effect between
the processes, data mining methods are applied to analyze the complex interdependencies
between the processes (Ngo & Schmitt, 2016).
Quality Improvement of Milling Process
Forecasting vibrations during milling of components for the aerospace industry is achieved
through the implementation of ML algorithms that predict critical process conditions. Very
high requirements for the product quality with regards to surface roughness or dimensional
deviations must be met and can be accomplished by adjusting machining parameters in
advance to avoid critical conditions of the process (Frye & Schmitt, 2019).
Preventive Quality Assurance in Clothing Industry
ML-based systems are not only used in the production process of metal parts to predict the
product quality, but also find application in the production of other goods such as textile. In
cooperation with a German clothes manufacturer, an algorithm was trained for the purpose of
preventive quality assurance which automatically feeds insights about failure rates into the
design process, without the need for any manual data analysis (Nalbach et al., 2018).
3 Theoretical Fundamentals 9
In this thesis, predictive quality for a generic production process is considered. On the basis
of measured values, an ML model shall predict if a product will pass the quality control at a
certain stage in production by analyzing the sensor data from the production process itself.
Therefore, each product is to be categorized into classes such as “pass” and “fail” or “ok” and
“not ok”. For a higher level of detail, a model might categorize items in even more than two
classes, e.g., into three groups such as “parts ok”, “rework needed”, “scrap”. As described
before, applications for predictive maintenance and production quality might work similarly to
predictive quality but still need to be distinguished as they all fulfill different tasks and pose
different requirements.
3.2 Machine Learning (ML)
Machine Learning (ML) is a huge field of investigation with many authors covering the topic
in depth. For the purpose of deploying ML models and not building them, it is necessary to
achieve a basic understanding of what ML is and how ML models work.
3.2.1 Definition
First of all, ML must be defined and differentiated from other fields of study. Frequently ML
appears in connection with Artificial Intelligence (AI), Deep Learning and Data Science. AI
serves as an umbrella term for ML and Deep Learning as visualized in Figure 3.1 and is
defined as giving computers the capability of mimicking human behavior, particularly
cognitive functions. Techniques such as robotics, synthetic language and cognitive vision
pertain to AI (Kotu & Deshpande, 2019, pp. 2–3).
Figure 3.1: AI, ML, deep learning and data science based on Kotu and Deshpande (2019, p. 3)
Artificial Intelligence
Machine Learning
Deep
Learning Data
Science
3 Theoretical Fundamentals 10
ML as a sub-field or tool of AI covers techniques which give computers the ability to learn
from experience in form of data without being explicitly programmed to do so (Kotu &
Deshpande, 2019, pp. 2–3). The difference between traditional and ML programs is depicted
in Figure 3.2.
Figure 3.2: Traditional program and machine learning (Kotu & Deshpande, 2019, p. 3)
While traditional programming is rule-based, ML aims to learn inherent patterns (Sarkar et
al., 2018, pp. 5–7). The automated detection of meaningful patterns in data is referred to as
learning (Shalev-Shwartz & Ben-David, 2019, p. vii). The definition of learning goes beyond
memorizing past data and describes converting experience into expertise. Gained knowledge
enables broader generalization which means to apply expertise gained from known
examples to unseen data in order to make a prediction (Shalev-Shwartz & Ben-David, 2019,
pp. 19–20). Learning in ML unlike in psychology, cognitive science, or neuroscience does not
aim to understand the learning processes in humans and animals but aims to build a useful
system (Alpaydin, 2014, p. 14). As a condensation of the fairly technical definition originally
formulated in 1997 by Mitchell, ML describes learning from experience. This experience is
the input data for the learning algorithm (Shalev-Shwartz & Ben-David, 2019, pp. 19–20).
Algorithms are computational methods using experience to improve performance or to make
accurate predictions (Mohri et al., 2018, p. 1).
As indicated in Figure 3.1, ML and data science show an overlap. ML and data science
methods are used to extract value from data (Harrington, 2012, p. 5; Kotu & Deshpande,
2019, p. 3). What ML has in common with the fields of statistics, operations research and
management information systems is the goal to make data-driven decisions, with the
difference that these fields do not consider reasoning or intuition (Sarkar et al., 2018, pp. 4–
5). Among others, further connected fields are mathematics, data mining and computer
science (Singh Bisen, 2019, p. 13). Deep Learning describes a subset of ML which makes
the computation of multi-layer neural networks feasible (Jeffcock, 2018). Regarding their
operating principle neural networks deviate strongly from common ML algorithms and, thus,
form a separate category.
Traditional
Program
Machine
Learning
Input (X) Output (Y)
Input (X)
Output (Y)
Representative model
of the program
3 Theoretical Fundamentals 11
Primarily, ML aims to gain insight from data (Harrington, 2012, p. 5). In addition to
understanding available data, its goal is to generate accurate predictions for unseen items by
designing efficient and robust algorithms to produce these predictions even for large-scale
problems (Mohri et al., 2018, p. 3). This is fueled by the need of making data-driven
decisions at scale (Sarkar et al., 2018, p. 4).
Implementing ML is especially beneficial when there is no exact model available and useful
approximations can only be made based on existing and accessible data (Alpaydin, 2014,
pp. 1–2). Furthermore, domain specific problems with a lack of human expertise or problems
at scale with huge volumes of data with too many complex conditions and constraints are
predestined for the utilization of learning methods. ML is also suitable for environments with
continuously changing behavior as well as for conditions where formally explaining or
translating human expertise into computational tasks, e.g. speech recognition, proves to be
difficult (Sarkar et al., 2018, p. 9).
Real-life applications of ML extend across many fields such as retail, finance and
manufacturing (Alpaydin, 2014, p. 3). In retail, possible use cases for ML include
personalized recommendations of products in online shops. Forecasting of stock market or
detection of fraud are examples from a finance point of view. When looking at use cases
from manufacturing, ML can be used to detect failures and defects in production (Sarkar et
al., 2018, p. 65). In the sector of automobile manufacturing, a common use case consists in
predictive maintenance (Singh, 2021, pp. 48–52). AI solutions not only find application in the
manufacturing process itself, but also in the supply chain planning (Rodríguez et al., 2020).
3.2.2 Life Cycle of ML Projects
There are different frameworks available for managing the life cycle of ML projects.
Knowledge discovery in databases, known as the KDD process model, covers the steps
selection, preprocessing, transformation, data mining and interpretation/ evaluation. A
second framework is SEMMA which is an acronym that stands for the steps sample, explore,
modify, model, and assess. Finally, the so-called cross industry standard process for data
mining (CRISP-DM) provides a standardized and generally valid procedure (Azevedo, 2008).
Approaches for KDD and data mining can be applied to structure ML projects as these terms
are used in computer science to describe the same methods which are used for ML
(Alpaydin, 2014, p. 3, 2014, p. 16).
Out of the introduced process models, CRISP-DM is still the most popular framework for
executing data science projects (Saltz, 2020). It is composed of the steps business
understanding, data understanding, data preparation, modeling, evaluation and deployment
(see Figure 3.3). The arrows in the graphic illustrate that each phase must not be analyzed in
an isolated manner due to the dependencies between phases and the methodology’s cyclical
nature.
3 Theoretical Fundamentals 12
Figure 3.3: CRISP-DM (Chapman et al., 2000, p. 10)
Business understanding focuses on converting business requirements into a specific
problem definition. The data understanding includes gaining insights into the available data
which are necessary for the subsequent data preparation phase during which adequate
transformations on the raw data are executed to obtain the final dataset. In the modeling
phase, different techniques and parameters are applied and tested to create the best
possible model. Once the model is built, an evaluation assesses if the objectives are met.
Finally, deployment describes the integration of the ML model into an organization’s
decision-making processes.
3.2.3 Data Preparation and Modeling
The steps of business understanding and data understanding do not require specific ML
knowledge but depend highly on the use case. However, for data preparation and modeling
technical understanding of ML is necessary. Both activities are closely related activities in the
ML life cycle. Table 3.1 summarizes relevant definitions and terminology for the described
steps.
Each instance of data can be described through a set of attributes, the features. In other
words, features form an instance (Harrington, 2012, p. 8). This means that every instance of
data can be viewed as a vector of feature values. If the input data does not come with built-in
features, they need to be constructed by the developer of the ML application. This feature
construction might be necessary for use cases with instances with no provision of attributes.
Any kind of adjustment of existing features in order to achieve better results by a learning
algorithm is included in feature transformation (Flach, 2012, pp. 38–46). Two approaches to
transform features are feature selection and feature extraction. Feature selection methods
select or discard features to reduce the overall number of features. Feature extraction
methods engineer new features from the existing ones (Sarkar et al., 2018, p. 40). Not all
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
3 Theoretical Fundamentals 13
available data is used for training an algorithm, but the data set is divided into a training and
test set, typically in an 80 % to 20 % ratio (Géron, 2018, pp. 30–31).
Table 3.1: Definitions and terminology (Mohri et al., 2018, pp. 4–5)
Examples Instances of data
Features Set of attributes
Labels Values or categories assigned to examples
Hyperparameters Free parameters as inputs to the learning algorithm
Training sample Examples used to train a learning algorithm
Validation sample Examples used to tune the parameters of a learning algorithm when
working with labeled data
Test sample Examples used to evaluate the performance of a learning algorithm
Loss function A function that measures the difference, or loss, between a predicted
label and a true label
Tasks are the problems that can be solved with ML (Flach, 2012, p. 13). At Microsoft, an ML
task is defined as the type of prediction being made, based on the available data and the
question that is being asked (Quintanilla et al., 2019). Three of the most common and most
important tasks are classification, regression, and clustering, which all serve different
purposes. While classification predicts a nominal target value, that is to say classes,
regression predicts a continuous value (Harrington, 2012, p. 9). Categorizing an item into
one of two classes is called binary classification, in the case of more than two different
classes the task is referred to as multiclass classification. Clustering uses similarity and
distance to group items. Figure 3.4 illustrates how each one of the mentioned tasks work. For
classification and regression, past data must be labeled, that means the classes respectively
numerical values of previous elements must be known. In contrast, clustering does not need
any additional information in form of a target value or given label from past data (Harrington,
2012, p. 10).
3 Theoretical Fundamentals 14
Figure 3.4: Binary and multiclass classification, regression and clustering (Singh, 2021, pp. 8–
10)
ML types help to distinguish tasks for a better understanding of the existing variety in ML
(Figure 3.5). The presented selection of tasks, also including classification, regression, and
clustering, does not provide an exhaustive list of all available tasks in academia, but aims to
give an overview of the most relevant ones.
Figure 3.5: Machine learning types according to en.proft.me (2015)
Supervised learning requires the availability of a target variable or label, whereas
unsupervised learning does not (Mohri et al., 2018, pp. 6–7). A supervised learner is
provided with extra information in form of labels by the environment. In case of classification,
an instance is affiliated to a class through a label, for regression a continuous number for
each instance is given. Unsupervised learning is characterized by a learning algorithm which
processes input data without external supervision (Shalev-Shwartz & Ben-David, 2019,
pp. 22–23). In order to cluster similar items of data, no label or target variable is required.
Semi-supervised learning, as suggested by the name, builds on a partially labeled data set
(Mohri et al., 2018, pp. 6–7). A small, labelled training set is used to build an initial model,
which is then refined using the unlabeled data. This ML type can be useful obtaining labelled
data is associated with high cost (Flach, 2012, pp. 14–20).
Classification Regression Clustering
Binary Multiclass
Supervised
Learning
Unsupervised
Learning
Semi-supervised
Learning
Reinforcement
Learning
Machine Learning Types
Classification Regression
Continuous
target variable
Categorical
target variable
ClusteringDimensionality
ReductionClassification Clustering Classification Control
Target variable
not available
Categorical
target variable
Categorical
target variable
Target variable not available
3 Theoretical Fundamentals 15
Reinforcement learning may or may not need a target variable but works in a different matter
than the other three ML types. Visualized in Figure 3.6, an agent with set of strategies or
policies takes action on observing the state of the environment, gets a reward or penalty and
updates the policies (Sarkar et al., 2018, pp. 42–43). In order to reach the goal, the agent
generates a policy through the assessment of past sequences of actions (Alpaydin, 2014,
p. 13).
Figure 3.6: Visualization of reinforcement learning (Singh, 2021, pp. 12–13)
An enormous number of algorithms for different ML tasks can be found in the literature.
Common algorithms for the supervised task of classification are k-nearest neighbors, support
vector machines and decision trees, which all work in a different way (Harrington, 2012,
p. 10). The k-nearest neighbors algorithm classifies data based on distance measurement to
existing data points and, in doing so, looks at the top k most similar pieces of data
(Harrington, 2012, p. 19). Support vector machines aim to separate data with the maximum
margin, in other words, the best separating line is to be found (Harrington, 2012, p. 102).
Decision trees split data sets one feature at a time and have the advantage of being
understandable by humans even without specific ML knowledge (Harrington, 2012, p. 38). In
academia, there are a lot more algorithms that are either less common than the presented
ones or are applied in other use cases such as the Naïve Bayes algorithm using probability
theory in order to classify based on non-numeric, nominal values (Harrington, 2012, p. 62).
Instead of one single model, multiple models in form of an ensemble can be employed.
Combining multiple learners is a strategy to confront the No Free Lunch Theorem which
states that there is no single learning algorithm which in any domain always induces the most
accurate learner (Alpaydin, 2014, p. 487; Flach, 2012, p. 330).
Regarding the learning protocol, there is online and batch learning. Online learning means
that the learner has to respond online, throughout the learning process, so there is no
separation between the training phase and prediction phase (Shalev-Shwartz & Ben-David,
2019, p. 24). Online learning is also referred to as incremental learning as the model
continuously learns with new data (Flach, 2012, p. 361). In batch learning scenarios, the
model is trained on large amounts of training data at once before making predictions (Sarkar
et al., 2018, pp. 43–44).
Agent
Environment
RewardState Action
3 Theoretical Fundamentals 16
3.2.4 Evaluation and Deployment
The performance is usually a quantitative measure or metric which is used to see how well
the algorithm or model is performing the task with experience (Sarkar et al., 2018, p. 12).
Binary classification has already been identified as the adequate task for predictive quality.
Thus, the focus lies on performance measures for this specific application area while metrics
for further tasks such as regression are not covered.
Table 3.2 depicts the so-called confusion matrix which juxtaposes the actual classes and the
ones predicted by the ML model. Two classes are differentiated: positive and negative. True
positives, short TP, are all elements that are actually positive and are identified correctly as
positive by the model. Following the same logic, true negatives (TN) belong to the actual
negative class and are rightly identified as negative. False negatives (FN) and false positives
(FP) are wrongly classified as negative respectively positive.
Table 3.2: Confusion matrix (Harrington, 2012, p. 144)
Actual
Positive Negative
Predicted
Positive True Positive (TP) False Positive (FP)
Negative False Negative (FN) True Negative (TN)
The performance of a classifier can typically be measured through a set of different metrics
shown in Table 3.3 including accuracy or error rate with the sum of accuracy and error rate
equaling exactly 1. In addition to accuracy and error rate, typical performance measures for
classification include precision, recall respectively sensitivity, specificity, and F1-score. All of
them range between 0 and 1 with 1 being the perfect score. Using a single indicator as the
objective function for the optimization of an algorithm’s parameters may lead to undesired
results. Therefore, the F1-score combines precision and recall into one metric.
3 Theoretical Fundamentals 17
Table 3.3: Classification metrics with formula (Flach, 2012, pp. 53–61; Harrington, 2012, p. 24;
Sarkar et al., 2018, p. 12)
Metric Verbally Described Formula Mathematical Formula
Accuracy = number of correct predictions
total number of predictions =
TP + TN
TP + TN + FP + FN
Error rate = number of wrong predictions
total number of predictions =
FP + FN
TP + TN + FP + FN
Precision = true positives
predicted positive results =
TP
TP + FP
Recall, Sensitivity = true positives
actual positive results =
TP
TP + FN
Specificity = true negatives
actual negative results =
TN
TN + FP
F1-Score = 2 ∗precision ∗ recall
precision + recall =
TP
TP +12 (FP + FN)
Instead of the standard metrics, it is also possible to use a modified personalized cost
function to adjust the weights to, e.g., penalize wrong classification (Sarkar et al., 2018,
p. 12). Nonetheless, the success of an ML algorithms is subject to the available data (Mohri
et al., 2018, p. 1). The fact that a model’s performance is always only as good as the data is
expressed by Sarkar et al. (2018, p. 44) through the following phrase: “Garbage in, garbage
out.”
The performance is not only limited by the data quality but also by bias which describes prior
knowledge respectively prior assumptions when building the model so that it influences the
performance on the task (Shalev-Shwartz & Ben-David, 2019, p. 60). Performance measures
allow to identify deviations from the optimum model complexity (Singh, 2021, p. 15). The
trade-off between the sample size and complexity plays a critical role in generalization. When
the sample size is relatively small, choosing a too complex algorithm may lead to poor
generalization, which is also known as overfitting. On the other hand, with a too simple
algorithm it may not be possible to achieve a sufficient score, which is known as underfitting
(Mohri et al., 2018, p. 8).
If the evaluation phase of the life cycle is passed successfully, the deployment is realized.
Generally, deployment is defined as “the action of bringing resources into effective action”
(Oxford University Press, 2020). When talking about the deployment in context of ML,
different definitions of deployment exist. Singh (2021, p. 57) defines the task of deployment
as integrating the ML model into an existing business application, which coincides with the
3 Theoretical Fundamentals 18
deployment’s definition in CRISP-DM. In an alternative definition by Galli (2020), the
deployment of ML models refers to making the models available in a production environment
in order to provide predictions to other software systems and clients. In this thesis, the goal
of deployment is defined as follows:
When it comes to deployment, not only the ML models but the whole ML pipeline is to be
deployed. An ML pipeline encompasses all the steps required to get a prediction from data
with the ML algorithm only being one component of said pipeline (Figure 3.7).
Figure 3.7: Typical Steps in an ML Pipeline based on Galli (2020)
The goal of building an ML model is to solve a problem and an ML model can only do so
when it is in production and actively in use by consumers (Singh, 2021, p. 58). To maximize
the value of any ML model, the capability to extract predictions reliably and share with other
systems must be enabled. As such, model deployment is as important as model building.
3.3 Software Engineering
After creating an ML model for a use case in quality management, the deployment of the
created models requires an understanding of principles from the field of software
engineering. According to IEEE Computer Society, software engineering is defined as the
“application of a systematic, disciplined, quantifiable approach to the development, operation,
and maintenance of software”. First, the focus is set on traditional software and then on the
characteristics of software in combination with ML.
3.3.1 Traditional Software
The development process of traditional software is described followed the relevant aspects of
collaboration and automation during all three phases development, operation and
maintenance.
Development
Initial approaches to software development date back to 1956, when the nine phase stage-
wise model was introduced by Benington. Based on this, Royce presented his proposal for
managing the development of large software systems in 1970 (see Figure 3.8). Although the
term itself was not used by the author, the approach gained popularity under the name of
waterfall model and is considered the traditional approach to software development.
Data OutputFeature
Engineering
Feature
Selection
ML
Algorithm
ML Pipeline
Making a resulting model available in a specific environment
in order to make the results usable where they are needed.
3 Theoretical Fundamentals 19
Figure 3.8: Implementation steps to develop a large computer program (Royce, 1970)
As evolution continued, improved approaches such as the Spiral Model by Boehm (1988) or
the Vee-Model (Forsberg & Mooz, 1998) were introduced. As a shared weakness between
all previously mentioned approaches, the need of continuously repeating stages was not
anticipated proactively but was executed in a reactive way. For this reason, methodologies
with an iterative character in form of Software Development Life Cycle (SDLC) shown in
Figure 3.9 emerged (Everett & McLeod, 2007, p. 57).
Figure 3.9: Software development life cycle according to bigwater.consulting
Collaboration and Automation
Originally introduced as an agile framework for software development, the scrum technique
uses incremental, iterative work sequences to manage and enhance the speed of
development projects. When working in a team of developers, work is divided into actions
that are completed within sprints. Daily scrums serve to track process and re-plan (Figure
3.10). The framework does not only find application in fields related to software but also and
manifested itself in other areas such as product development. The topic has lost none of its
relevance so that guides for the scrum methodology are continuously developed and
improved (Schwaber & Sutherland, 2020).
System
Requirements
Software
Requirements
Analysis
Program Design
Coding
Testing
Operations
1Planning
2Analysis
3Design
4Implementation
5Testing &
Integration
6Maintenance
3 Theoretical Fundamentals 20
Figure 3.10: Scrum framework according to Scrum.org
In 2009 during a conference about deploying software, the term DevOps as a combination of
the words software development and IT operations was created. The main idea is to focus on
the collaboration between developers and operators while not neglecting the alignment with
the business side. DevOps provides a collection of field-tested and working approaches to
address this problem (Halstenberg et al., 2020). Figure 3.11 shows a graphical visualization
of the DevOps process which has been adapted widely in the community. The DevOps
approach features building blocks which correspond to the phases in previous software
development models with the main difference to traditional approaches being the
collaboration between development and operations.
Figure 3.11: DevOps approach according to Harlann (2017)
Practices such as continuous integration (CI), continuous delivery (CD) and continuous
deployment go even one step further by automating steps. They aim to reduce the required
effort and avoid possible errors which can occur during a manual execution resulting in a
more efficient and more secure development process. Figure 3.12 reveals the difference
between the mentioned practices.
Sprint
Backlog
Product
Increment
Daily
Scrum
Scrum
Team
Sprint
Planning
Product
Backlog
Sprint
Review
Sprint
Retrospective
code
test monitor
bu
ildo
pe
rate
deploy
Dev Ops
3 Theoretical Fundamentals 21
Figure 3.12: Continuous integration, delivery and deployment according to Pennington (2019)
3.3.2 ML Software
Deploying an ML model can be understood as developing a program around the model
which is to be made accessible to the end-user. As a very important distinction, deployment
as part of the ML workflow must not be mistaken for the deploy step in software development
approaches such as DevOps. The deployment of ML models includes the whole software
development process. Nonetheless, there are three fundamental differences between
developing an ML application in comparison to conventional software. First, the handling of
data is more complex. Second, skills in both software engineering and ML are required. And
third, software components are interweaved with no clear boundaries. It is therefore all the
more important to understand the challenges of ML software and address problems timely
(Amershi et al., 2019).
Since ML places special demands on software development, but fundamental similarities
exist, approaches from software engineering can be tailored to ML in a modified form. In this
way the discipline of MLOps arose. Just like in DevOps, robust automation, trust and
collaboration between teams as well as delivering high quality along the whole end-to-end
service life cycle play key roles. However, MLOps represents a new and unique discipline as
deploying traditional software is not as complex as deploying an ML application into
production (Treveil & Dataiku Team, 2020, pp. 6–7). In comparison to the complexity of the
environment in ML in form of dynamically changing data in ML, traditional software is
relatively static. Graphically, the DevOps circle can be extended to include ML as an
upstream task, as illustrated in Figure 3.13.
Figure 3.13: MLOps (Neal Analytics, 2020)
code
test monitorb
uild
op
era
te
deploy
Dev Ops
Continuous
Integration
Continuous
Delivery
Continuous
Deployment
PlanCreate
Data
Model
VerifyPackage
Release Configure
Monitor
ML Dev Ops
4 State of the Art 22
4 State of the Art
In this chapter, the state of the art is analyzed by examining already existing approaches
about ML deployment for predictive quality in production. After the definition of evaluation
criteria, a literature review is executed including the evaluation of selected publications. Out
of the selected approaches, the most relevant ones are then presented in more detail.
4.1 Definition of Evaluation Criteria
Selecting suitable criteria for the evaluation of available concepts represents an important
step in this work. The criteria must be defined in a coherent way so that, if all criteria are met,
the overall objective is achieved. According to Keeney and Gregory (2005) appropriate
attributes are unambiguous, comprehensive, direct, operational, and understandable. In
other words, good evaluation criteria need to be accurate, not redundant, ends-oriented,
practical, and easy to understand. Furthermore, independence between attributes is to be
strived for (Keeney, 1992).
In order to structure the defined criteria, they are assigned to the object domain, the solution
hypothesis, or the target domain. With criteria belonging to the object domain, the scope of
the analysis is evaluated. The purpose and goal of an approach is captured by criteria
associated to the target domain. With the help of criteria regarding the solution hypothesis,
the specific solution path chosen by the respective authors to achieve the goal is assessed.
Table 4.1 gives an overview of the defined evaluation criteria, which can be applied to
sources across different fields of investigation.
Table 4.1: Evaluation criteria
Category Evaluation Criteria
1. Object Domain 1.1 Deployment of ML
1.2 ML for Predictive Quality
2. Solution Hypothesis 2.1 Strategic Planning
2.2 Operational Realization
3. Target Domain 3.1 Guideline Structure
3.2 Transferability
4 State of the Art 23
Deployment of ML
As the first of two criteria of the object domain, it is evaluated with which level of detail
approaches deal with the deployment of ML models or any AI application. Based on
evidence from studies, chapter 2 showed that many ML projects fail during the deployment
phase and all activities associated with deploying ML models need to be analyzed critically.
Thus, it is necessary to evaluate existing approaches with respect to their touch points with
deployment. These might range from only mentioning the deployment up to focusing on
deploying ML models and ML software.
ML for Predictive Quality
A missing coordination between roles is one of the common reasons for unsuccessful
realization of projects (see chapter 2). In other words, the inclusion of domain knowledge is
crucial for the success of ML in general but also for the deployment. In chapter 3.1.2,
exemplary use cases of ML for quality prediction in production were described highlighting
the relevance and potential of the application of ML in this context. Predictive quality use
cases come with their individual requirements which can deviate greatly from applications in
other fields and environments. Therefore, existing approaches are examined to see to what
extent they cover the application of ML models for predictive quality. Available sources may
not treat the use of ML in manufacturing environments at all. Others may describe its use for
generic purposes in production with predictive quality being the most specific application.
Strategic Planning
This and the following criterion relate to the question of how the authors address the topic.
As seen in the detailed description of the problem in practice, the success of ML projects
depends both on strategic planning and operational realization.
High-level decisions comprise planning activities that shape the future direction by
determining the desired characteristics of the system under consideration of the selection’s
importance due to the long-term effects of the decision. This criterion analyzes the level of
detail with which existing approaches cover the deployment from a strategic point of view.
Approaches which fulfill this criterion to an advanced degree present detailed concepts and
frameworks in form of theoretical analyses or generic procedures.
Operational Realization
In addition to the strategic perspective, the challenges for deployment also need be
addressed on a more practical level. Among many others, relevant factors for the quality and
efficiency of implementation activities are the use of best practices and selection of tools. By
means of this criterion, approaches are assessed regarding their operational depth. This
operational point of view focuses on practical questions about the best way of successfully
realizing the implementation. Approaches may present a use case from a real-life application
including the provision of tools, implementation steps and results.
4 State of the Art 24
Guideline Structure
As an overall objective, this thesis aims to answer the research question formulated in
chapter 1.2 which focuses on how to deploy ML models for predictive quality in production.
Based on the current status of ML projects in practice, companies need instructions in form
of a guideline which they can transfer to their individual needs for successfully realizing the
deployment.
With the help of this penultimate criterion, the format of the respective approaches is
assessed to determine to what extent the respective approach serves as a guideline.
Approaches that aim to be used as a guideline may provide step-by-step instructions or a
clear structure which allows the reader to gain insights and knowledge on how to address the
topic.
Transferability
Not only is a guideline format required, but also the transferability to different use cases,
which also fall into the category of ML in production but are based on a different set of
requirements, needs to be given. Thus, this criterion evaluates how well the respective
approaches are transferable to further use cases with company-specific needs and
restrictions in order to find the best fitting solution for any specific situation. It is analyzed how
easily existing approaches can be implemented in different environments or based on a
different set of requirements.
4.2 Literature Review
Based on the procedure for literature analysis defined by Cumbie et al. (2005) the following
methodological steps are applied in this work:
1. Accumulation and selection of publications
2. Categorization of publications
3. Evaluation of publications
At first, a pool of publications is accumulated out of which the most relevant ones are
selected. As a second step, selected publications are classified by their type and method. In
the third and final phase, an analysis of the results in form of an evaluation is conducted.
4.2.1 Step 1: Accumulation and Selection of Publications
Accumulation of Publications
As the deployment of ML models for predictive quality in production is located on the verge of
two fields of investigation, production engineering on the one hand and software engineering
on the other, the literature review needs to reflect this. Due to the quite different character of
both fields regarding the availability of literature, an individual search process for each field is
required. Both searches are not sharply separable, so that results can appear multiple times
within one search or across the two searches.
4 State of the Art 25
Table 4.2: Search strings
Search I
Focus on industrial production engineering
Search II
Focus on software engineering
machine learning
AND
deploy*
AND
predictive quality OR production quality OR
product quality OR manufacturing quality OR
quality prediction
machine learning OR artificial intelligence
AND
deploy*
AND
producti* OR model serv* OR software
engineering
Note: The asterisk (*) indicates a set of key words beginning with the respective prefix.
For publications treating the application of ML models in manufacturing settings, the principal
source is the database ScienceDirect containing journal articles. The search strings in the left
columns of Table 4.2 are used to search within the title, abstract and key words. Combining
all searches, 133 search results were found in ScienceDirect with the same articles
appearing multiple times across different search terms. Furthermore, the same search
strings were used for a search in SpringerLink. The search was limited to sources with
“machine learning” in the title and having to contain the term “deployment” and each of the
listed quality-related key words. The search resulted in 52 items. Given the search
parameters, all the beforementioned publications focus on the application of ML models for
quality-related tasks in manufacturing. From a production engineering perspective, they
provide relevant use cases and help to identify the needs that are relevant for the
deployment but do not cover the deployment from a software engineering point of view.
Illuminating the deployment of ML models as a software engineering topic requires a different
set of search strings, which can be found in the right column of Table 4.2. Initially, the search
is conducted in a similar manner to the previously described procedure by conducting an
advanced search in ScienceDirect and SpringerLink. In ScienceDirect, the search resulted in
77 findings. In SpringerLink, the search was adjusted in such a way that all of the key words
in the last group were not search independently but in combination (AND instead of OR).
This led to the identification of 119 items. As an additional source, the ACM digital library was
consulted offering a comprehensive collection of full-text articles covering the fields of
computing and information technology. It allows to find conference contributions that were
published as proceedings. By searching for machine learning or artificial intelligence in the
title, deploy in the abstract and the variation of the last group of the key words in the full text,
88 search results were found.
4 State of the Art 26
In order to gather all relevant sources, searches in dynamically changing areas of
investigation cannot be limited to academic publications but need to include so-called gray
literature. In a field like software engineering, the academic literature only gives an
incomplete view on the topic. Through gray literature publications, practitioners can provide
contextual information from their experience and, thus, verify scientific outcomes from a
practical point of view (Garousi et al., 2019). As activities for deploying ML models are
closely related to software engineering, the search in this thesis considers gray literature. ML
deployment is characterized by being a dynamic subject with a lot of approaches which do
not follow scientific rules. In many cases, best practices are created by professionals to
streamline the processes in the real world and only then transferred into academic
environments. Thus, relevant sources also include conference contributions which are not
published in a proceedings book, white papers from associations and companies, and
internet articles such as blog entries by highly acknowledged practitioners to share
experiences from practice. To identify the application-oriented publications, Google Scholar
and the regular Google Search Engine were used. The searches, based on the same search
string as described before, resulted in a very high number of results out of which the first
pages of results were considered.
Selection of Publications
Based on the found search results, the most relevant items for this thesis are selected.
Depending on the type of publication, selection criteria differ. For all resulting journal articles,
either from production or computer science background, the abstract is read and matched
with the scope of this work. Where applicable, the selection considered manufacturing-
related publications while discarding search results from other fields such as medicine. If a
book is selected depends on the title, table of contents and introduction. Conference papers
typically do not undergo an equally profound review process as journal articles and therefore
need to be examined with more detail regarding their quality and type of conference. Gray
literature publications such as white papers and internet articles offer the least level of
credibility and require an even more profound examination of worthiness. This examination
reviews the background and expertise of each author. Additionally, the credibility and
independence of the publishing website is assessed. Moderated online-publishing-platforms
are more likely to offer objectivity than company-sponsored pages that might aim to advertise
a certain product. The quality of content in form of the provided level of detail and extent of
the text also impacts the selection decision. Furthermore, gray literature does not only
demand a quality check as a prerequisite for the use of this kind of source but also requires a
separate archiving process as its availability, in contrast to, e.g., journal articles, is not
guaranteed in the future.
4 State of the Art 27
Based on this initial set of publications, further relevant sources are to be investigated.
Snowballing is a technique that can be applied to identify related publications (Wohlin, 2014).
Backward snowballing refers to identifying new relevant articles in the bibliography of a
paper, whereas forward snowballing means to examine papers citing the respective paper in
the starting set. This forward and backward search can also be applied to find similar
publications by the same author.
Additionally, it is possible to find connected papers and visualize the connections through
graphs as illustrated exemplarily through a screenshot in Figure 4.1 taken from
connectedpapers.com. The graph is not a citation tree but arranges papers according to their
similarity. Similar papers have strong connecting lines and cluster together. Each paper’s
number of citations is represented by the node size, the node color indicates the publishing
year. For publications that are identified through techniques such as snowballing or
connection graphs the same selection criteria as for the initial set apply.
Figure 4.1: Connected papers to Sculley et al. (2015) from connectedpapers.com
For the first, production engineering-related search, a total of 19 publications were selected.
As stated before, the second search focusing on the field of software engineering resulted in
27 relevant approaches that were selected for further analysis. Through the inclusion of gray
literature, the number of selected publications is slightly higher than for the first executed
search. In total, 46 publications are identified as relevant for this thesis and will be analyzed
in the following in more detail.
4 State of the Art 28
4.2.2 Step 2: Categorization of Publications
To classify the selected items, metadata such as publication year and type of origin are
examined and visually illustrated. Figure 4.2 arranges the final search results by the
respective year in which they were published. ML in manufacturing and deploying ML models
has gained popularity over the last years so that the most relevant sources for this work
originate from 2020. The low number of results from 2021 is due to the time of conducting
this research in spring 2021. Overall, the graphic highlights the high topicality and emerging
relevance of the subject and serves as a confirmation of the need for research. Over the
course of the upcoming years, the number of publications about the topic is expected to
increase even more.
Figure 4.2: Year distribution of selected publications (Total of 46)
In order to gain more insight into the identified results, all selected publications are classified
by the type of publication in Figure 4.3. Roughly a half of the final results belong to the
category of books respectively chapters in books and journal articles. Conference papers
make up nearly a quarter of the selected publications. Gray literature in form of white papers
and internet articles accumulate to a little more than a quarter. It can be seen that academic
literature is not sufficient, but a mix of different publication types is necessary to fully grasp
the topic.
Figure 4.3: Type distribution of selected publications (Total of 46)
1 1
3
5
8
21
7
0
5
10
15
20
25
2015 2016 2017 2018 2019 2020 2021
Book/ Chapter5
Journal Article19
Conference Paper10
White Paper3
Internet Article9 10,9%
41,3%
21,7%
6,5%
19,6%
4 State of the Art 29
4.2.3 Step 3: Evaluation of Publications
As a third and final step, the content of each approach within the selected results is
evaluated. For a consistent evaluation of the literature, the existing approaches are assessed
with the help of the criteria defined previously in chapter 4.1. As there are no binary criteria
like yes-no questions, it is analyzed to what extent each criterion is fulfilled by the existing
approaches. The degree of fulfillment of each criterion ranges from not at all fulfilled, to
sparsely, partly, mainly, and completely fulfilled. Table 4.3 has the respective authors listed
in the rows and the evaluation criteria in the columns with the fulfillment degree for the
approaches being visualized by Harvey balls. The approaches are grouped according to their
focus, either production or software engineering. They are arranged regarding the publishing
year and sorted by the author’s name within each year.
4 State of the Art 30
Table 4.3: Evaluation of existing approaches
Explanation:
● Completely fulfilled
◕ Mainly fulfilled
◑ Partly fulfilled
◔ Sparsely fulfilled
○ Not at all fulfilled
1.1
Deplo
ym
ent
of M
L
1.2
ML f
or
Pre
dic
tive Q
ualit
y
2.1
Str
ate
gic
Pla
nnin
g
2.2
Opera
tional R
ealiz
ation
3.1
Guid
elin
e S
tructu
re
3.2
Tra
nsfe
rabili
ty
Search I
Brüning et al. (2017) ◔ ● ◔ ◕ ○ ◔
Vafeiadis et al. (2017) ◔ ● ◕ ○ ◔ ◑
Mehta et al. (2018) ◑ ◕ ◕ ◑ ○ ◑
Nalbach et al. (2018) ◔ ● ◑ ◕ ○ ◑
Ariharan et al. (2019) ◕ ◕ ◑ ◔ ○ ◔
Escobar et al. (2020) ◔ ● ◕ ◕ ○ ◔
Kimera and Nangolo (2020) ◑ ◑ ◑ ● ○ ○
Krauß et al. (2020) ◔ ● ◑ ◑ ○ ◔
Lehmann et al. (2020) ◑ ◕ ● ● ○ ◔
Rychener et al. (2020) ◕ ◑ ● ◔ ◔ ◕
Schorr et al. (2020) ○ ● ◑ ◑ ○ ◔
J. Schmitt et al. (2020) ◑ ● ◕ ◑ ○ ○
Svetashova et al. (2020) ○ ● ◔ ● ○ ○
Yong and Brintrup (2020) ◑ ◑ ◑ ◑ ○ ◑
Goldman et al. (2021) ◔ ● ○ ◕ ○ ○
Lichtenwalter et al. (2021) ◑ ◕ ◔ ◑ ◔ ◕
Pilarski et al. (2021) ◕ ◔ ◑ ◕ ○ ◔
Turetskyy et al. (2021) ◑ ◕ ◕ ◑ ○ ◔
Zeiser et al. (2021) ◔ ● ◔ ◔ ○ ○
4 State of the Art 31
1.1
1.2
2.1
2.2
3.1
3.2
Search II
Sculley et al. (2015) ◑ ○ ◕ ◔ ◑ ◕
Zinkevich (2016) ◕ ○ ◕ ◑ ◕ ●
Breck et al. (2017) ◑ ○ ◕ ◔ ◑ ◕
Ackermann et al. (2018) ● ○ ◔ ◑ ◔ ◔
Crankshaw and Gonzalez (2018) ◕ ○ ◑ ○ ◔ ◔
Muthusamy et al. (2018) ● ○ ◑ ◔ ◔ ○
Amershi et al. (2019) ◕ ○ ◕ ○ ◔ ◑
Gisselaire et al. (2019) ● ○ ◕ ◔ ◔ ◕
Kervizic (2019) ● ○ ◑ ◔ ◕ ◑
Lwakatare et al. (2019) ● ○ ◕ ◑ ○ ◔
Samiullah (2019, 2020) ● ○ ◕ ○ ● ◕
Sato et al. (2019) ● ○ ◕ ◕ ● ◑
Washizaki et al. (2019) ◕ ○ ◑ ○ ○ ○
Agrawal and Mittal (2020) ◑ ○ ◔ ○ ◕ ○
Akyildiz (2020a, 2020b) ● ○ ◑ ○ ◑ ●
Bhatt et al. (2020) ● ○ ◕ ○ ◔ ○
Debauche et al. (2020) ● ○ ◑ ◑ ◔ ◑
Figalist et al. (2020) ◕ ○ ● ◑ ◑ ◑
Liu et al. (2020) ● ○ ◑ ◔ ◔ ◑
Odegua (2020) ● ○ ◔ ◔ ◕ ◕
Pääkkönen and Pakkala (2020) ● ○ ● ◔ ○ ◑
Patruno (2020) ● ○ ◑ ● ● ◑
Pinhasi (2020) ● ○ ◑ ◔ ◑ ●
John et al. (2021) ● ○ ● ○ ◔ ◑
Singh (2021) ● ○ ◑ ● ◕ ◕
4 State of the Art 32
Results of search I contain scientific publications from an industrial or production engineering
background. When looking at the reached fulfillment of the first two criteria, which belong to
the object domain, it is possible to detect the origin of the articles. The approaches cover
possible applications of ML in manufacturing, not always the use case of predictive quality
but similar use cases and quality-related issues. Thus, the respective criterion is fulfilled in
many cases. As the approaches focus on the development for these ML models, the
deployment itself is not covered sufficiently. Some approaches do not provide any useful
information about the deployment, others do give some insights on the deployment process.
A common solution path for authors of items in this category is provide either a strategic or
an operational perspective on the subject. Hence, not many approaches perform well at both
criteria of the solution hypothesis. Finally, the mean performance in the two criteria of the
target domain is very different. On the one hand, almost no approach serves as a guideline.
In many cases, a specific application is presented. Even though the authors do not provide
instructions, the use cases often are general enough to be transferable to other use cases.
In search II, approaches about the deployment of ML models from a software-related
background are grouped together, whose focus does not lie on possible quality-related
applications in manufacturing but on the deployment process. Therefore, the performance of
the approaches in the first two criteria is evident. Most approaches focus on or at least treat
the deployment as a key factor. But the connection to a manufacturing environment is not
made. Similar to the previous category of publications, many approaches do not cover
strategic planning and operational realization at the same time. The items in the table that
were identified through a gray literature review, typically fulfill the guideline criterion the best.
As many of them are aimed at being used by other people, an easy transferability to other
use cases is given in many cases.
Overall, both categories, that were introduced in the search, are characterized through their
own strengths and weaknesses. Approaches from engineering describe the requirements
and specialties of applications in the industry, but do not provide instructions which makes it
difficult to adapt the presented concepts to individual new use cases. Publications with a
software engineering focus more commonly provide instructions but do not consider the
specific circumstances when deploying in a production process. Regarding the strategic and
operational depth, in both categories one can find only rarely an approach that covers both
levels. As a conclusion, there is no approach fulfilling all criteria to a satisfactory degree.
4 State of the Art 33
4.3 Most Relevant Approaches
In the following, a selection of the most important approaches from both categories is
presented. These relevant approaches introduce concepts which lay the basis for the
development of the own methodology in the further course of this thesis.
Automated ML for Predictive Quality by Krauß et al. (2020)
ML-based quality prediction allows the reduction of production lead-time and repair costs, but
heavily depends on specialized human resources. This is where AutoML comes into play, a
technique to automate all repetitive and uncreative ML tasks in order to increase the time
spent on creative tasks. A process chain consisting of six different processes, through which
each product runs sequentially after passing a Quality Assurance-Gate at the end of each
process (Figure 4.4). If a product is in-spec or off-spec is predicted by an ML model. It
classifies each product after completing process 5 into “ok”, “failure A” or “failure B”.
Figure 4.4: Illustration of the process chain (Krauß et al., 2020)
The authors focus on setting up and testing automated techniques for ML and only hint at
possible deployment options without further explaining or classifying them. Deploying the
model as a web server in combination with an API is identified as the most common
approach. Furthermore, on-edge deployment are described briefly.
Predictive model-based Quality Inspection by J. Schmitt et al. (2020)
In the publication, a prediction model based on supervised ML algorithms which allows to
predict the final product quality on the basis of recorded process parameters is developed
and deployed into the IoT-architecture of a manufacturing plant. This integrated solution of
predictive model-based quality inspection in industrial manufacturing is based on the fields of
ML techniques and edge cloud computing technology. Edge cloud computing combines
cloud computing and computing on an edge-device. In the framework (Figure 4.5) the
deployment as the organizational integration into the inspection planning process is
distinguished from the technical implementation which gives an orientation for the individual
configuration of the system according to requirements and resource constraints. Lastly, the
technical integration into the existing infrastructure is too individualized and therefore not
covered by the authors. The process is illustrated by a real-world use case also including a
brief description of tools that were used.
Start Process 1 Process 2 Process 3 Process 4 Process 6Process 5 End
Product runs sequentially through the process chain
QA: Quality
Assurance
QA:
In-spec or
off-spec
QA:
In-spec or
off-spec
QA:
In-spec or
off-spec
QA:
In-spec or
off-spec
QA:
In-spec or
off-spec
QA:
In-spec or
off-spec
4 State of the Art 34
Figure 4.5: Predictive model-based quality inspection framework (J. Schmitt et al., 2020)
Hidden technical Debt by Sculley et al. (2015)
In their widely recognized paper about hidden technical debt in machine learning, Google
researchers Sculley et al. explore several ML-specific risk factors to account for in system
design in order to avoid massive ongoing maintenance costs in real-world ML systems.
Figure 4.6 illustrates the complexity of ML graphically by recognizing that a mature system
might end up being 5 % machine learning code and 95 % glue code, which does not add any
functionality but only serves to make different parts of code compatible. Due to the enormous
complexity, small changes may cause incalculable effect. The authors refer to this as the
CACE principle: Changing Anything Changes Everything. Thus, a tiny accuracy benefit at the
cost of massive increases in system complexity is not recommended.
Figure 4.6: ML Code as small fraction of ML systems (Sculley et al., 2015)
ML Test Score by Breck et al. (2017)
In order to reduce the technical debt, Breck et al. introduce testing and monitoring for
ensuring the production-readiness of an ML system. But again, testing of ML systems proves
to be more challenging than in traditional software systems due to the strong dependency on
data and models. Figure 4.7 indicates the increase in complexity necessary. The approach
provides a checklist which can be run against any ML system.
Technical integration
Technical implementation
Data StorageModel training
and scoring
Model deploymentData collection
and processing
Physical process
1 3
2
4
5
ConfigurationData Collection
Feature Extraction
Data
Verification
Machine
Resource
Management
Serving
Infrastructure
Monitoring
Process
Management Tools
Analysis ToolsML
Code
4 State of the Art 35
Figure 4.7: Traditional system and ML-based system testing and monitoring (Breck et al., 2017)
Rules for ML by Zinkevich (2016)
The author presents best practices in ML from his experience at Google in a similar form to
guides to practical programming. In total, 43 rules arranged in four parts are presented. In
the first part called “before ML”, there are three rules that aim to help determining whether
the time is right for building an ML system. The second part, ML Phase I, is about deploying
an initial pipeline and monitoring it on the basis of adequate objectives. This section contains
12 rules. ML Phase II, as the third of the four parts, comprises 22 rules about launching and
iterating while adding new features to the pipeline. These rules also treat the evaluation of
models and the training-serving skew. Finally, in the last part called ML Phase III, six rules
about slowed growth, optimization refinement, and complex models are discussed.
CD4ML by Sato et al. (2019)
When researching ML deployment from an application-near point of view, the publication by
Sato et al. in collaboration with Martin Fowler is one of the most cited references. Continuous
Delivery for Machine Learning (CD4ML) is a software engineering approach to develop,
deploy, and continuously improve ML applications. The end-to-end process (Figure 4.8)
consists of steps that can be automated with three axis that are subject to change and must
be taken into account: code, model and data. With the help of an example, the steps
beginning with the model building and ending with the monitoring and observability are
illustrated. Relevant aspects such as ensuring discoverable and accessible data, setting up
reproducible model training, tools for collaboration, hosting and exchange formats are
covered.
Code Running System
Unit
Tests
Integration
Tests
System
Monitoring
Traditional System Testing and Monitoring ML-Based System Testing and Monitoring
Code Model Training Running System
Unit
TestsIntegration Tests
System
Monitoring
Prediction
Monitoring
Data
MonitoringSkew TestsData Tests
Data Data
ML Infrastructure TestsModel
Tests
4 State of the Art 36
Figure 4.8: Continuous delivery for ML end-to-end process
With regard to model serving, three approaches are proposed. In the first option, the model is
embedded into the consuming application and is deployed as an artifact. A second option is
to deploy a model as a separate service where the model wrapped in a service that can be
deployed independently of the consuming applications. Finally, the model can also be
published independently, but the consuming application will ingest it as streaming data in real
time.
The necessity of testing of data quality, component integration and model performance as
well as experiments tracking is stressed. Complex scenarios for deploying models such as
shadow deployment are mentioned and briefly explained. In addition to the tests before the
deployment step in the process, monitoring and observability includes checking and
interpreting the model’s inputs, outputs and performance.
Putting ML Models in Production by Kervizic (2019)
The author provides an overview of different approaches to deploying ML models in
production identifying two important considerations in the form of train and serve. Training
can be executed as one off, batch training and real-time/ online training. Not updating the
model in production is called one off training. Batch training, in comparison, describes the
process of releasing a refreshed version of the model based on the latest train whereas
continuously updating is called online respectively real-time training. Each training scenario
comes with its own advantages and disadvantages. With regard to serving predictions to
systems wanting to consume the information there are batch predictions and real-time
predictions, which differ in the capability of ingesting live input and thus have implications on
cost and complexity of the computation infrastructure. For batch predictions, the predictions
are served through data exchange formats. Real time predictions can be served through a
Model Building
Model
Evaluation and
Experimentation
Productionize
ModelTesting Deployment
co
de
mo
de
ld
ata
Training
code
Test
code
Application
code
Monitoring and
Observability
Training
data
Candidate
models
Test
data
Metrics
Chosen
model
Productionized
model
Test data
Model
Code and model
in production
Production
data
4 State of the Art 37
database trigger, a pub/sub model, a web-service or even in-app. Each approach is suitable
for different situations and requires different technologies for the implementation.
How to deploy ML Models & Monitoring in Production by Samiullah (2019), (2020)
Deployment of ML models is hard as it combines all the challenges of traditional code with an
additional set of machine learning-specific issues. The first step is to derive an adequate ML
system architecture from business requirements and company goals by specifying the
necessity of real time predictions, model update frequency, data characteristics, regulated
environment and the team’s experience. The author, who also appears as a professor in an
online Udemy course called “Deployment of Machine Learning Models”, proposes four
common architecture patterns (see Table 4.4) which each have their pros and cons and must
be selected according to the specific use case.
Table 4.4: Four potential ML system architecture approaches
Pattern 1
REST API
Pattern 2
Shared DB
Pattern 3
Streaming
Pattern 4
Mobile App
Training Batch Batch Streaming Streaming
Prediction On the fly Batch Streaming On the fly
Prediction result
delivery
Via REST API Through the
shared DB
Streaming via
Message Queue
Via in-process
API on mobile
ML systems need to fulfill some key principles such as reproducibility, which consists of
building reproducible pipeline from data gathering, data preprocessing, variable selection and
model building, as well as testing, which is a crucial aspect and may be executed in form of
differential, benchmark and load/ stress tests. Tools for containerization, CI/CD, hosting
platforms and emerging frameworks for managing the ML life cycle are addressed just as the
need for monitoring and alerting.
In a subsequent post, the same author focuses on monitoring ML models once they are
deployed. Monitoring in combination with testing is used to understand the spectrum of ML
risk management with the trade-off between level of confidence in the model’s behavior and
the ease of making adjustments. Data science issues occur regarding the data, whereas
operational issues are linked to the system performance. Observability describes the ability
to comprehend what is happening inside of the system. The use of metrics and logs for
monitoring purposes is explained and illustrated through pseudo code, before closing with a
current overview of the constantly changing software landscape.
4 State of the Art 38
How to serve and monitor Models by Akyildiz (2020a), (2020b)
In the blog post by an engineering manager at Facebook, the three most common ways of
serving models are identified. The first pattern is to materialize/ compute predictions offline
and serve through a database. The second architecture consists of using model within the
main application, so the model serving/ deployment can be done with the main application
deployment. The third option how to serve models is to use the model separately in a
microservice architecture where you send input and get output. Each architecture is
evaluated in detail with regard to criteria such as system set-up effort, maintainability,
scalability and infrastructure complexity, real-time capability, flexibility, and traceability
showing which advantages and disadvantages need to be considered.
Based on serving models, a further blog by the same author identifies the importance of
monitoring the performance of the ML model, service downtime as well as changes in data
and behavior. The author describes specific aspects that need monitoring and provides
adequate criteria for monitoring such as metrics. In addition, best practices for effective
monitoring are briefly mentioned.
Ultimate Guide to Deploying ML Models by Patruno (2020)
In a series of blogs and starting with the relevant factors for ML deployment and the
interaction of the end user with the ML model, the author presents the importance of best
practices. Standardized software interfaces with defined inputs and outputs reduce the
implementation effort. Furthermore, model registries serve to store and track trained ML
models. The next key decision is the selection of the type of inference system. If a batch
inference scheme that precomputes predictions in batch does not fulfill the requirements, real
time predictions can be generated through online inference infrastructure which comes with a
set of challenges. Relevant considerations in the selection process are explained in detail.
Testing is crucial for ML, so the author recommends executing tests for each added function
in the form of test-driven development. Offline testing is done before deployment and focuses
on ML performance metrics. Online validation, also known as experimentation, aims to detect
causality between the deployed ML models and business KPI. At the same time, model
monitoring is used to detect the need for retraining. The extensivity of tests depends on
complexity of the application, the business cost of model errors and the resource constraints
of the organization. In order to illustrate the procedure, each aspect of the guideline is
illustrated through pseudocode of an exemplary use case. Providers for deployment
configurations or tools for ML are referred to. In total the process of deploying new model
versions is described as "non-trivial" for reasons like the conflicts between the prototyping
team and deployment team.
4 State of the Art 39
4.4 Theory Deficit
Concluding, a theory deficit can be derived from conducted analysis of the state of the art.
Through the evaluation of existing works based on a set of defined criteria, a research gap
concerning the deployment of ML models in production is identified. The existing
approaches, results of a thorough review of academic and informal literature, do not answer
the research question satisfactorily. Thus, further research on the basis of the found results
is necessary.
5 Outline of the Methodology 40
5 Outline of the Methodology
In accordance with the defined object of the thesis and the findings of analyzing not only the
problem in practice but also the state of the art from an investigation-related point of view,
the development of the methodology is subject to preceding considerations and boundary
conditions. These include the requirements which are to be met by the methodology, an
exact definition of the scope, and a contextualization with respect to existing frameworks.
5.1 Requirements
The developed methodology is subject to individual content requirements based on the set
objective of the work. At the same time, generally valid formal requirements for the
development of a methodology apply.
5.1.1 Content Requirements
Answering the research question formulated in chapter 1 while considering the practical
deficit outlined in chapter 2 presupposes that the solution proposal meets some content
requirements. The evaluation criteria used to derive the theoretical deficit in chapter 4 form
the basis for the following content-related requirements:
• Deployment of ML: The methodology focuses on deploying ML models.
• ML for Predictive Quality: The methodology focuses on applications of ML for
predicting the quality in manufacturing processes.
• Strategic Planning: The methodology provides the relevant high-level decisions for a
successful deployment of ML models into production.
• Operational Realization: The methodology presents relevant aspects of the practical
implementation of the deployment process.
• Guideline Structure: The methodology can be used as a guideline with instructional
character which can be followed in order to deploy ML models successfully.
• Transferability: The methodology can be transferred to further use cases
characterized by different requirements.
5.1.2 Formal Requirements
The starting point for defining formal requirements for a methodology is the term itself. Here,
a methodology is described as a “set of methods used in a particular area of study or activity”
(Cambridge Dictionary, 2014). These methods are applied to gain scientific or practical
knowledge with the help of models as representation of real systems. Therefore,
requirements from model and system theory are initially placed on the methodology
(Stachowiak, 1973):
5 Outline of the Methodology 41
• Representation: The methodology represents the defined observation area.
• Contraction: The methodology simplifies the overall system to the relevant attributes
and elements.
• Pragmatism: The methodology can be applied by a specific user in a target-oriented
way.
In addition, the requirements established by Patzak (1982) in the context of systems
engineering apply:
• Empirical correctness: The methodology is consistent with reality.
• Formal correctness: The methodology is free of contradictions.
• Productivity: The methodology is providing useful answers.
• Manageability: The methodology is easy to apply and interpret.
• Low effort: The methodology’s use is associated with low effort.
5.2 Scope
Due to the huge extent of the topic, clearly defined boundaries of the scope are necessary.
These boundaries concern the programming language, used algorithms, and use case
characteristics.
As a first limitation, only ML models that are developed in the programming language Python
are considered. Python constitutes the quasi-standard for ML with the main reason for this
being is the availability of provided libraries (Subasi, 2020, p. 96). In addition to the libraries
available, there are even more advantages for Python: its clear syntax, easy text
manipulation, its popularity with people and organizations, open-source availability, and
readability through pseudo-code (Harrington, 2012, pp. 13–15). This restriction excludes ML
models written in R or other languages, but it does not mean that the final application must
also be programmed in Python. The applications that provide the model to the user may very
well be built in alternative languages depending on the device the user chooses to access.
Consequently, the methodology does not exclude languages such as Java or C if they come
into question for the development of the application.
There is an enormous choice of different algorithms for ML models in Python. Therefore, this
thesis focuses on regular ML algorithms and leaves out neural networks, which have their
own individual requirements for the deployment. Deep learning algorithms in combination
with Big Data require much more computing power needed resulting in higher cost, longer
training process and data handling issues due to data size. Furthermore, only those
algorithms are considered which process structured, tabular data and do not receive
unstructured data inputs in form of video, audio or other types of signals. This restriction is
legitimate due to the intended application area which is predictive quality. Sensor data from
production is fed to an ML model that aims to predict the output quality of the process.
5 Outline of the Methodology 42
The described use case, not only manufacturing in general but predictive quality in particular,
applies to a manufacturing companies with high quality standards. If quality is not a
designated business strength, the complex use of ML is not necessarily worth the effort.
Companies, for which predictive quality is beneficial, have in common is that they are
specialized, mostly medium-sized companies with their core competence in production.
Software and IT are seen as a tool to support the production, but not as a core value. Small
companies do not have the manpower in IT or the knowledge of creating and maintaining
complex ML systems. They focus on the fabrication of products and use ML as a tool to
improve the processes, specifically to improve the quality. Consequently, these companies
cannot be compared to big internet firms which use ML as an essential part of their business
model and thus can allocate a considerable number of resources on development and
deployment of ML models. For small and medium sized companies, the factors of cost and
dependency are very important so that non-specialized concepts and tools in form of
open-source solutions are treated primarily. Companies with purely digital business models
may have different requirements and may evaluate concepts and tools differently than those
traditional companies considered here. The choice of tools and concepts has far-reaching
consequences in terms of dependency on third parties and own efforts. Bringing ML to
production must come with a reasonable effort and price. At the same time, they do not want
to be dependent on one specific software solution by one provider only, so open-source
solutions become more relevant. They are free and thus reduce the cost and dependency on
a provider. Measuring and evaluating these efforts financially is not part of the methodology.
Rather, proposed solutions are assessed qualitatively without making concrete quantitative
statements about economic aspects.
5.3 Reference Framework
The methodology for deploying ML models is not developed in isolation as a stand-alone
entity but is inserted into an existing framework, the AutoML pipeline as shown in Figure 5.1.
5 Outline of the Methodology 43
Figure 5.1: AutoML pipeline in the context of production based on Krauß
Starting with the use case selection, the pipeline provides an overview of the end-to-end
operations needed to enable ML in a production environment and identifies the expertise
needed to execute the processes. It covers the entire ML lifecycle from a macro perspective.
As the first block, data integration treats the process of gathering relevant data from several
different sources. With data residing at many different sources, combining them can prove
itself as a challenging task. In the data preparation block, the dataset is generated. Data
preprocessing operations target the increase in data quality followed by feature engineering,
the pre-computation of features to facilitate the extraction of knowledge by an ML algorithm.
In modeling, a model is generated. After choosing an algorithm, the algorithm selection, the
algorithm is set up through hyperparameter optimization. In training, the ML algorithm is ran
to generate a model. In diagnosis, the focus is on understanding a model’s results from a
domain expert perspective. The modeling steps are executed in several iterations, often
requiring further adjustments in the dataset. Even between data preparation and modeling
feedback loops may be necessary.
The last and final block is the deployment. The deployment itself is broken down into the
following sub-phases: Deployment design, productionizing & testing, monitoring as well as
retraining. These sub-phases are presented and elaborated in detail in the following chapter.
Based on a successful deployment, certification aspects can be considered.
Data Integration
Data Preparation Modeling Deployment
AutoML Pipeline in Production
Data & Process Understanding
Data Preprocessing
Feature Engineering
Algorithm Selection
Hyperparameter Tuning
Training
Diagnosis
Use Case Selection Certification
Deployment Design
Productionizing & Testing
Monitoring
Retraining
6 Development of the Methodology 44
6 Development of the Methodology
Within the course of this chapter, the results of the development of the methodology are
presented. Figure 6.1 summarizes the core findings for each phase which are subsequently
explained one after another.
Figure 6.1: Overview of methodology
Deployment
Deployment Design
(→ Chapter 6.1)
Monitoring
(→ Chapter 6.3)
Productionizing & Testing
(→ Chapter 6.2)
Retraining
(→ Chapter 6.4)
Consuming Application
How are predictions consumed?
Web app (accessible via browser)
Native app (installed on device)
Hosting Solution
How is the system hosted?
On-premises
Cloud
Prediction Approach
How are predictions made?
By batch
In real time
Learning Method
How are models updated?
Offline
Online
Model Serving
How are models served?
Embedded (into consuming application)
Separate (from consuming application)
Execution
How is the retraining executed?Extent
How extensive is the retraining?
Trigger
How is the retraining triggered?
Understand
What is the system doing?
Monitor
Is the system working?
Analyze
How to improve the system?
Plan
Define application
requirements
Develop
Create application
based on
requirements
Test
Define and execute
tests on the
application
Release
Roll out application to
users
Automate
Execute steps in an
automated manner
6 Development of the Methodology 45
As an own contribution in this thesis, available concepts, either in form of theoretical
fundamentals or introduced in existing approaches, are analyzed and joint together in order
to form one complete concept covering the whole deployment process from end to end.
Analyzing the state of the art in chapter 4 showed that no existing approach fulfills the
objective of this thesis alone. Consequently, for the development of this methodology,
concepts from different authors and fields of investigation are combined in a structured
procedure which treats all relevant decisions and steps in the phases of the deployment
design, productionizing & testing, monitoring, and retraining. Relevant aspects supporting
these decisions and steps are provided to ensure the instructional character of the developed
methodology.
The deployment design represents a strategic decision which needs to take the use case into
account. Characteristics of predictive quality applications are especially relevant in this
design phase. Subsequent phases of the methodology face more operational issues which
are generally relevant for the deployment of ML models and not specific to a certain kind of
use case. In addition to the four phases, general aspects that overarch the whole deployment
are covered.
6.1 Deployment Design
As an initial task in deployment, called the deployment design, decision owners need to
design the ML system which then serves as a target image for the implementation. The
deployment design is a two-step phase. Firstly, the business needs and restrictions are
translated into technical requirements. Secondly, these technical requirements are used to
define the system architecture. In this way, the most suitable architecture for the given use
case characteristics can be found.
6.1.1 Pre-considerations: Design Requirements
For the technical requirements, there is only a limited number of options. In order to organize
the findings, a morphological box as introduced by Zwicky and Wilson (1967) comes to
application. The requirements comprise parameters and possible values each parameter can
assume. By breaking down the overall problem into attributes, the technique allows to
compress and structure visually the huge and disorganized variety of deployment options
and even create new, unseen solutions by combining values. In doing so, the terminology is
harmonized as different authors use different denominations for similar principles.
Figure 6.2 shows the identified parameters as well as the corresponding technical question
each parameter aims to find an answer for. Possible values for each parameter are also
depicted. By selecting one option for each parameter, the design requirements for the
system architecture are determined. Subsequently, all parameters and the available
solutions are explained focusing on the applicability in the context of production.
6 Development of the Methodology 46
Figure 6.2: Morphological box for deployment design
Prediction Approach
Predictions can be made by batch or in real time. Batch predictions have a forecast character
as they do not consider real time input. In contrast, real time predictions are calculated at the
exact required moment triggered either by a user request or by the arrival of new data.
The design of the ML system is primarily impacted by the necessity of real time capability
which has implications on the effort and cost associated with the operation. With pre-
calculated predictions by batch, computing can be spread out according to available
capacity. For real time systems, the availability of the service must be ensured during peak
loads including the planning of a failover system. Thus, monitoring and debugging activities
are more complex and time critical resulting in higher cost (Kervizic, 2019).
For predictive quality use cases the decision for or against a prediction approach depends
mainly on the maximum waiting time in production. If no real time predictions are needed, a
batch system should be considered as it generates less complexity and requires less
maintenance effort.
Consuming Application
Displaying the predictions of an ML model requires the distinction between web apps and
native apps. For the end user of the predictions, the look and feel of web apps and native
apps might be very similar, but the choice has a great impact on the system architecture. A
web app is an application that is accessible via network by any kind of connected device
without being downloaded onto the device. Native apps, on the other hand, are developed
and installed on a particular device and enable local computation. Web apps have the
advantage of being accessible from multiple different types of devices via browser. Through
Consuming Application
How are predictions consumed?
Web app (accessible via browser)
Native app (installed on device)
Hosting Solution
How is the system hosted?
On-premises
Cloud
Prediction Approach
How are predictions made?
By batch
In real time
Learning Method
How are models updated?
Offline
Online
Model Serving
How are models served?
Embedded (into consuming application)
Separate (from consuming application)
6 Development of the Methodology 47
a network the predictions are made available for users in different locations. In contrast,
native apps must be installed onto every device, but are optimized for the specific platform
and can run without network connection. Limiting factors for native apps are the required
computing power and the higher development and operation effort in comparison with web
apps. These have the disadvantage of not being able to access a device’s built-in features as
they are developed for cross-platform operation (Bignu, 2019). The decision regarding the
consuming application mainly depends on the type and variety of the used devices in
production.
Running ML models on edge devices such as mobile phones and microcontrollers has
increased in popularity due to a need of on-device data analysis (Konstantinidis, 2020). As
these devices require an application compatible with their host operating system, the
consuming application in case of edge devices corresponds to a native app.
In production environments both web apps and native apps are used. The option of a web
app shall be selected for situation which require access from different devices, an easy
usage by non-specialized employees and universal compatibility. In cases of specialized
devices in production such as wearable devices or machining tools native apps are the
logical choice. In addition to displaying the prediction in a web or native app, a notification
about the predicted quality of a product can be sent through an email, a work ticket or a
visual light on the machine allowing an initiation of remedies by the responsible agent.
Model Serving
A key parameter for the architecture is the degree of integration between the calculation and
the consumption of the predictions. One deployment option is to embed the model in the
main application. This includes the possibility of integrating the model into the front end.
Alternatively, a model can be deployed as a separate service. This can be through a
webservice in the back end which receives input und returns output. But the separate service
can also be set up in a streaming manner.
In cases with separate model serving, the predictions need to be delivered to the consuming
application. In accordance with the previously described options, there are mainly three
realizations of result delivery:
• Via database
• Via REST API
• Via streaming
Making predictions available in a database enables a direct database access from the
consuming application. Instead of a database, an API can be used to deliver the results. An
API, short for application programming interface, is not a database or a server but organizes
the access to a webservice (Eising, 2017). REST APIs work according to a request-response
principle, whereas in streaming scenarios a continuous data stream is published to which the
consuming service can subscribe. As an illustrating analogy, a REST API can be compared
to a waiter in a restaurant who takes orders and returns the desired result (Houghton, 2018).
6 Development of the Methodology 48
Streaming, on the other hand, can be compared to a newsletter from an online shop. The
consumer subscribes once to a service and then automatically receives the newest data
without having to request it explicitly every time (Björklund, 2017).
Relevant considerations for the decision regarding an embedded or separate model serving
include the required serving latency and scalability. Calling an external service can increase
waiting time. However, deploying additional ML models to production is easier if models are
served separately and the development and operation of the services is decoupled. A further
crucial aspect is the induced complexity. Streaming model serving requires a complex
system set-up and is only recommended for situations in which a streaming calculation of
predictions in real time is absolutely necessary.
Learning Method
Previous phases, e.g., the selection of an algorithm in the modeling phase, have an influence
on the design of the deployment. As described in chapter 3.2.3, there is online and batch
learning. The learning method is a relevant parameter for the architectural design as it
defines if the ability of continuously training a deployed model must be given. Online learning
implies that all new data points are fed to the model for updating purposes before outputting
a prediction. If the model is learning offline, the training process can be treated separately
from the prediction process.
Similar to the prediction approach, the chosen option has an impact on the system
complexity. Offline training reduces the complexity of the ML system. In contrast, online
learning models are updated as soon as new data is available and therefore require
uninterrupted monitoring of the performance which makes the system more complex to
handle.
Typically, the selection of offline or online learning can be regarded as an input from a
previous phase in the ML life cycle. Already during the modeling, it is decided which kind of
algorithm is used. In big data and production scenarios with a very high volume of data, e.g.,
sensor data from a machine, online learning does not require to save all training data but
allows learning with the incoming data (Hunt, 2017).
Hosting Solution
As a final parameter of the deployment design, on-premises and cloud hosting are available
for selection. The responsibility for managing the whole system is distributed between the
organization itself and a cloud provider through a service level agreement (SLA). Different
cloud service levels are distinguished (see Figure 6.3). On-premises solutions are managed
completely within the organization with no external cloud provider involved. When opting for
a cloud option, a provider can supply an instant computing infrastructure known as
Infrastructure-as-a-Service (IaaS), a complete development and deployment environment in
the cloud called Platform-as-a-Service (PaaS), or a ready-to-use software solution which is
referred to as Software-as-a-Service (SaaS). As it is about hosting the ML system, which is
6 Development of the Methodology 49
independent from the data hosting, the responsibility for data always lies within the
organization itself (Chen, 2020).
Figure 6.3: Cloud service levels based on Watts and Raza (2019) and Chen (2020)
An analogy illustrates the different service levels. On-premises is the equivalent of owning a
car. IaaS is like a rental car as the hardware is provided by the car rental company with some
responsibility on the renting person such as refueling. PaaS can be compared to a taxi with
whose operation the customer is not involved but still can decide on the route. Finally, SaaS
solutions are externally managed and can be compared to a bus with a fixed route (Choo,
2018).
On-premises hosting requires sufficient knowledge and resources to operate own servers
and networks. The more responsibility is given to external providers, the less effort inside the
company to manage hardware and software is involved. A trade-off between ease of use,
cost, potential dependency from an external supplier and data privacy is required.
Especially the issue of data security poses a main challenge in manufacturing use cases.
Production data represents the most sensitive information of a company so that its security
must be a top priority resulting in the advantageousness of on-premises solutions.
6.1.2 Architecture Patterns
Based on the selection of options by means of the morphological box, the system
architecture can be designed. Figure 6.4 shows common architectures in practice. In each
pattern, one option for the respective parameters introduced in chapter 6.1.1 is selected. The
figure also indicates the data flow from the data sources to the consumer of the predictions
which can be pushed or pulled to the next element.
On-premises Cloud
IaaS PaaS SaaS
█ Data
█ Applications
█ Runtime
█ Middleware
█ O/S
█ Virtualization
█ Servers
█ Storage
█ Networking
█ Data
█ Applications
█ Runtime
█ Middleware
█ O/S
█ Virtualization
█ Servers
█ Storage
█ Networking
█ Data
█ Applications
█ Runtime
█ Middleware
█ O/S
█ Virtualization
█ Servers
█ Storage
█ Networking
█ Data
█ Applications
█ Runtime
█ Middleware
█ O/S
█ Virtualization
█ Servers
█ Storage
█ Networking
Explication:
█ Self-managed
█ Provider-supplied
6 Development of the Methodology 50
Figure 6.4: Common architecture patterns in practice
Data sources are placed in the figure as a generic block with no further specification as the
patterns focus on handling the ML model and not the data. Furthermore, the figure does not
show an exhaustive list. There may be additional but less common patterns as well as
individual architectures through the variation and combination of existing patterns. In the
following, the most common styles are described in more detail.
Shared Database
The first architecture is the one with the lowest complexity. The ML model is handled in a
Python script, which brings the model in the correct format and provides a function to
calculate predictions. These predictions then are saved into a shared database, hence the
name of the pattern. Relevant users can easily access the predictions from a web app or
other applications (Akyildiz, 2020b; Samiullah, 2019).
By delivering the prediction results via a database, which can be an already existing one in
the organization, the complexity of the overall system is kept at low levels. However, the
execution of the script is not triggered in real time by the end user. Rather, it is executed after
a defined schedule either manually or automatically through a job scheduler. Due to the lack
of real time capability, the shared database pattern mainly serves as a proof of concept. This
means that it is a good way for an initial deployment to bring results into production. But for
situations requiring good scalability and predictions in real time, it is not the preferred choice
(Samiullah, 2019).
Web App
Database
Prediction by batch
Web app
Model separate (result
delivery via database)
Offline learning
On-premises hosting
Shared Database
Script
ML Model
Data Sources
Web App
Database
Prediction in real time
Web app
Model separate (result
delivery via database)
Offline learning
Cloud hosting
In-database
ML Model
Data Sources
Native App
Prediction in real time
Native app
Model embedded (result
delivery within app)
Offline learning
On-premises hosting
In-app
ML Model
Data Sources
Webservice
Web App
Prediction in real time
Web app
Model separate (result
delivery via REST API)
Offline learning
Cloud hosting
Webservice
ML Model
Data Sources
Streaming Platform
Web App
Prediction in real time
Web app
Model separate (result
delivery streaming)
Online learning
Cloud hosting
Streaming
ML Model v1
ML Model v2
Data Sources
Explication:
█ Push
█ Pull
6 Development of the Methodology 51
In-database
By integrating the ML model directly into a database, the complexity increases but
predictions can be made in real time. Data from data sources is saved into the database and
the prediction is directly made. Thus, the person using the prediction can access the
database and retrieve the necessary data. As a limitation of this pattern, only databases with
ML capability can be used and the realization is highly dependent on the provider (Kervizic,
2019).
In-app
A different possibility for designing the architecture is to embed the ML model into a native
app. This pattern typically is used when running the computation on edge devices, which falls
into the category of on-premises hosting. Calculating predictions in-app on a mobile device
has the advantage of not needing any external connection which increases data security.
However, it comes with limitations such as the choice of frameworks for the specific device
and the computing power of the device. Data is not sent to a separate service for prediction
purposes, so that the device itself needs sufficient computation capability (Sato et al., 2019).
A concrete realization of this pattern is the integration of an ML model into the control
software of machine tools if the machine’s manufacturer allows interfering with its software.
As a model update requires to install a new app version on all consuming devices, the
scalability is bad (Kervizic, 2019).
Webservice
A common pattern in practice is to wrap the model in a webservice and deploy it as a
separate service (Akyildiz, 2020b; Pinhasi, 2020). The communication between the web app
and the webservice works in form of a REST API. A good scalability is achieved by using
existing approaches for webservices which are designed for handling high traffic through
measures such as load balancer. Moreover, cloud hosting of the service allows accessibly
from many different devices and locations. The system management difficulty is medium
combined with a good scalability, which results in this pattern being the best trade-off
between complexity and performance for many situations (Samiullah, 2019).
Streaming
A streaming architecture is also characterized by a separate ML model serving from the
consumer but works following the push principle. Data streams from production enter the
streaming platform, which then allows to train and predict in real time based on the incoming
data, and the prediction is pushed to the consuming application (Kervizic, 2019; Sato et al.,
2019). As it is hosted separately, a good scalability even for high volume of data is given.
The pattern’s biggest disadvantage is the very high complexity (Samiullah, 2019). Setting up
and running a streaming architecture requires a high level of maturity and effort. Thus, for a
given use case it must be analyzed in detail if the advantages outweigh the disadvantages.
6 Development of the Methodology 52
Table 6.1 summarizes the presented patterns by evaluating them regarding scalability and
complexity. As described before a webservice architecture allows a good scalability
combined with medium complexity.
Table 6.1: Evaluation of architectures
Shared
Database
In-database In-app Webservice Streaming
Scalability Medium Medium Bad Good Good
Complexity Low Medium Medium Medium High
6.2 Productionizing & Testing
In the deployment design, a target architecture for deploying an ML model into production is
defined, which then is to be implemented. Productionizing is understood as a series of
implementation tasks in order to bring a model from a research to a production environment
(Wheeler, 2019). During this transfer, testing represents a crucial aspect (Breck et al., 2017).
Before diving into the implementation steps, pre-considerations about the involved
environments are presented.
6.2.1 Pre-considerations: Environments
An ML model is developed in a research environment and deployed to a production
environment. Both environments are very different from each other. In the research
environment, the model is handled in a notebook by the data scientist. It is separate from
customer-facing software, so that experiments can be easily run. In contrast, the production
environment is live and accessible for the customer. Issues regarding scalability,
reproducibility and infrastructure planning must be considered (Galli & Samiullah, 2021).
Figure 6.5 illustrates the transition from the ML model development in a research
environment to the software development of the application including the ML model. This
application is not directly deployed to production but goes through the typical four tiers of
environments in software development: development, testing, staging and production
(Murray, 2006). The application is developed, then tested with respect to the integration with
other components, released to a pre-production environment awaiting approval before finally
being deployed to a live production environment.
6 Development of the Methodology 53
Figure 6.5: Environments for ML model development and ML software development
6.2.2 Implementation Steps
Following the same steps as regular software development, the implementation process in
Figure 6.6 shares commonalities with the DevOps cycle introduced in chapter 3.3. First,
requirements are defined in the planning phase. Then, the application is developed based on
said requirements. Subsequently, tests are defined and executed in order to ensure that the
application is working as planned. Finally, the program is released and rolled out to the user.
These consecutive steps can be automated.
Figure 6.6: Sequence of implementation steps
In comparison to traditional software, the process for deploying ML software is even more
complex. Whereas regular software is subject to changes in the code, ML software needs to
consider the changes in the data and the model additionally (Sato et al., 2019).
Compared to the deployment design with a limited number of parameters, the
productionizing & testing is characterized by being a complex process with no limited number
of options. A key factor for a successful deployment process is the application of best
practices from software engineering (Serban et al., 2020). Thus, for each step the most
relevant aspects are listed with a focus on ML-specific aspects.
6.2.2.1 Plan
In the plan phase, requirements for the development process are defined. Planning
comprises tasks which are common for any kind of software development project as well as
factors which are only needed for the deployment of ML.
TestingDevelopment Staging ProductionResearch
ML Model ML Software
Plan
Define application
requirements
Develop
Create application
based on
requirements
Test
Define and execute
tests on the
application
Release
Roll out application to
users
Automate
Execute steps in an
automated manner
6 Development of the Methodology 54
Project Management
The scope of the development project depends on the specifications of the ML system
defined in deployment design (chapter 6.1). Among further decisions, it primarily includes
specifying the number and type of applications to be developed in accordance with the
selected architecture pattern and the existing IT landscape in the organization.
For managing the project during its execution, frameworks such as Scrum (chapter 3.3) can
be applied. Activities for project management do not belong to ML-specific tasks and, thus,
are not covered in more detail at this point.
ML Functionality
From an ML perspective, the required functionality of the application is to be specified. It
should be defined in an early stage of the project life cycle, either during the business
understanding in CRISP-DM (chapter 3.2.2) or as an input of the stakeholders before the
deployment.
The most basic functionality which must be given in production is the possibility of generating
predictions. Further options include the evaluation of models, which is covered in the
monitoring phase of the deployment (chapter 6.3), or the capability of creating new updated
models, which is described in the retraining phase in chapter 6.4.
For prediction-making and also retraining, data cannot be used as it is but requires
preprocessing as described in the data preparation of the CRISP-DM (chapter 3.2.2). For the
development, it is to be specified which data preparation steps are executed by the ML
application. In many cases, the steps in training do not coincide with the steps for predictions
so that a detailed definition of data preparation functionality is indispensable.
Continuous Data Integration
The ML system relies on the input of data provided by other systems leading to so-called
data dependencies (Sculley et al., 2015). In order to address the data dependencies, it is
necessary to clearly define how the data, which the model needs for making prediction
during the serving phase, is continuously integrated. In production settings, data is typically
saved in databases due to the ease of use and flexibility. These databases reside either on-
premises or in the cloud. Alternatively, data sources can be data streams directly from the
machine or centralized data warehouse. The goal of this planning task is to ensure that the
ML application is linked correctly to the existing data infrastructure. Therefore, not only the
possible data sources need to be defined, but also the data input format, being a file or SQL
request, needs to be specified.
Multiple ML Models
Not always a single model is deployed but multiple models in form of duplicate models,
specialized models, stacked models, cascaded models, or competing models (Sato et al.,
2019). Duplicate models are multiple models performing the same task which allows to
6 Development of the Methodology 55
distribute a request between models if the response time of one algorithm is long.
Specialized models each have a different purpose, e.g., one for each product. Stacked
models all have algorithms which combined form a more powerful predictive model. For
cascaded models, the traffic is routed to an alternative model if the baseline model produces
a prediction with low confidence or a baseline model makes a first prediction and based on
first prediction forwards the task to a specialized model for so-called refinement. Competing
models work through the allocation of data traffic across several competing models to make
the best prediction. In all these cases, incoming data must be guided through the system to
the different models, which may even have different data preparation steps.
6.2.2.2 Develop
Based on the defined requirements, the application is developed. Again, the focus is on
aspects which are especially relevant in the context of ML deployment. General
considerations for any software development project such as best practices are not covered
in detail in this work.
Code Tracking
For version control and collaboration, code changes must be tracked. The two main options
for code tracking are depicted in Figure 6.7. In the GitHub workflow new features are
developed in separate branches and then pushed to the master branch which is in
production. Alternatively, the GitLab workflow separates the master from the production
branch. This separation is suitable for situations in which it is not possible to deploy every
time a feature branch is merged, e.g., for fixed deployment time windows. In comparison, the
GitHub flow is simple, clean and straightforward and, thus, more suitable for less complex
scenarios. In any case, the selected workflow needs to be aligned with the set-up of
environments and can deviate from the two presented ones.
Figure 6.7: GitHub vs GitLab workflow (GitLab, 2021)
Data and ML Model Versioning
Not only the code is to be tracked but also the data and the ML model. For this purpose,
models are versioned to allow comparability between different versions. At the same time,
GitLab Flow
Master
Production
New
Feature
Master
GitHub Flow
6 Development of the Methodology 56
each model version is linked to the respective training data in order to trace the data used for
the training of each model. Data preprocessing steps, as part of the data or the model,
require tracking as well. Data and ML model versioning ensure that decisions which were
based on a model’s prediction can be reproduced. Challenges in this context are the high
volume of data to store and a clear definition of how models are versioned (Amershi et al.,
2019).
ML Code Structure
Code enables the defined ML functionalities with its structure being dependent on the set-up
of the data preprocessing and possible multiple models. The regular way of structuring the
ML code is procedural programming, which can be seen on the left side of Figure 6.8.
Functions for data handling and calling the algorithm are written separately and are called
one after another. It has the advantage that the code from notebook in the research
environment, typically a Jupyter Notebook, can be adopted to a large extent. As a downside,
all functions are debugged separately and, thus, increase the effort. Alternatively, the data
preparation steps and the final algorithm can be joined into one exportable pipeline object. In
comparison to procedural programming, pipelines have a pre-defined structure that must be
complied with. Consequently, the effort for transforming the notebook code into a pipeline
object, which can be custom or provided by third parties (e.g., scikit-learn), is high if pipelines
are not introduced already in the research environments (Galli & Samiullah, 2021).
Figure 6.8: Procedural programming vs pipeline structure
ML Model Serialization Format
In order to be used as an object which can be integrated into an application, a model must be
serialized. In other words, the pipeline object respectively the trained algorithm is
transformed into one file, which facilitates versioning. Although the python standard format is
pickle, there are many more serialization formats. Specialized formats are available for other
types of algorithms or frameworks (Dowling, 2019). In exceptional cases, no serialization is
needed, e.g., for unsupervised learning algorithms that are run through a script.
Build Format
Not only the model but the whole application around the model must be made executable
and runnable. With the exception of directly runnable python scripts, the build format of the
program needs to be specified. Mainly, there are two options to take into account in the form
of packages and containers. Packages are bundles of files that are written for a target
operating system and need to be installed trough a package manager to run an application,
mostly on virtual machines. Containers are isolated sandbox environments that contain all
Data
preparation
Function
…
Function
Data
preparation
Function
Algorithm
Function Pipeline Object
Data
preparation…
Data
preparationAlgorithm
6 Development of the Methodology 57
necessary resources to run an application and share the kernel with other applications.
Multiple containers can be managed through an orchestration service (Fagerberg, 2015).
In Figure 6.9 the differences between the options are illustrated. For bare metal, an
application is installed directly on the host operating system. Virtual machines are used to
create multiple separate units all based on the same hardware but with an own guest
operating system (OS). As containers share the same operating system with other containers
and do not have their own guest OS, they are more lightweight but also limited to the host
operating system.
The main advantage of containers is that the application is not installed but already contains
all necessary information to be run. Thus, it is ensured that it is executed correctly on any
host. Due to this strength, the use of containers is increasing in popularity. Especially in
situations with no direct access to the production server, applications can be developed and
tested remotely as containers and then transferred to the deployment infrastructure.
Figure 6.9: Bare metal, virtual machines, and containers based on Kominos et al. (2017)
6.2.2.3 Test
The developed application is subject to thorough testing (Breck et al., 2017). As shown in
Figure 6.10, tests on different levels can be executed regarding the introduced dimensions of
code, model and data.
Hardware
Host Operating System
App App App
Bare Metal
Hardware
Host Operating System
Virtual Machines (VMs)
Hypervisor
App App
Guest OS
Virtual Machine
Bin/ Library
App App
Guest OS
Virtual Machine
Bin/ Library
Hardware
Host Operating System
Containers
Container Runtime
App
Container
Bin/ Library
App
Container
Bin/ Library
6 Development of the Methodology 58
Figure 6.10: Testing pyramid
Unit tests as the fundamental base for the testing pyramid are used to test components
during the development. One level up, there are integration tests which ensure that multiple
components work together as required. On the top of the pyramid, end-to-end tests validate
the whole application through real user scenarios. For each level, there are many different
tests available out of which Table 6.2 shows the most important ones with respect to the ML
deployment. The main challenge at this point is reproducibility between the research
environment and the production environment (Galli & Samiullah, 2021).
Table 6.2: Tests according to Sato et al. (2019)
Type of test Artifacts Test Description
Unit Data Data test: Validate data against schema or distributions
Integration Code and Model Contract test: Validate that the expected model interface is
compatible with the consuming application
Integration Model and Data Model quality test: Evaluate model performance through
metrics against a performance baseline
Model bias and fairness test: Check performance across
different slices of the data
Integration Code, Model and
Data
Consistency test: Validate that the exported model produces
the same results as the original one against a validation data
set
End-to-end Code, Model and
Data
End-to-end test: Validate the whole application
Unit
Tests
Integration
Tests
E2E
Test
Code Model Data
6 Development of the Methodology 59
6.2.2.4 Release
When it comes to releasing a tested and making it accessible in the live production, there are
different roll-out strategies available which are generally valid for conventional and ML
software. The roll-out strategy specifies the way of substitution of a live version of application
with a newer one. In this scenario, version A is currently active and shall be replaced by the
updated version B. The following strategies can be distinguished (Posta, 2015; Tremel,
2017):
• Recreate: Version A is terminated then version B is rolled out.
• Ramped (also known as rolling-update or incremental): Version B is slowly rolled out
and replacing version A.
• Blue/Green: Version B is released alongside version A, then the traffic is switched to
version B.
• Canary: Version B is released to a subset of users, then proceed to a full rollout.
• A/B testing (not for software release but to test features of the application): Version B
is released to a subset of users under specific condition.
• Shadow: Version B receives real-world traffic alongside version A and does not
impact the response.
The selected strategy is to be aligned with the number of models and the environment set-up
as for all strategies, except the ramped roll-out, the new version is deployed alongside the
old one.
6.2.2.5 Automate
The automation of the previously described steps promises a gain in efficiency and
deployment speed. Before automating, a manual execution is necessary in order to gain a
deep understanding of the whole process. Automated deployment achieves a reduced
possibility of errors, saving time, consistency and repeatability (Simek & Slomkova, 2021).
Illustrated in Figure 6.11, different levels of automation can be found that are developed for
conventional software but equally come to application for ML software (Sato et al., 2019).
6 Development of the Methodology 60
Figure 6.11: Degrees of Automation based on Chigira (2019)
6.3 Monitoring
Once a model is released to production, monitoring is a key consideration for ensuring
production-readiness of an ML system (Breck et al., 2017). In the software development
(chapter 3.3), monitoring serves as the last DevOps task at hand which closes the cycle. Like
the productionizing & testing, monitoring combines traditional software development with
ML-specific aspects.
6.3.1 Pre-considerations: ML Model Decay
From an ML perspective, all models have in common to deteriorate over time with only the
speed of the decay varying (Samuylova, 2020). Models in stable environments may achieve
a constantly high quality over a long period of time, in other cases the quality decreases
quickly. In any case, the following phenomena cause the ML model decay in the first place.
Data Drift
The data drift describes a change in data distributions (Samuylova, 2020). A shift in the
distribution in the input variables is called covariate shift, whereas a shift in the predicted
output, e.g., the predicted class, is captured under the term of prior probability shift (Stewart,
2019).
There are two scenarios in which drift can occur (Saha & Bose, 2021). Either data
distributions are compared between two different points in time after deployment or between
training and production data. The possible mismatch between data used for training and data
from live production is referred to as training-serving skew (Samuylova, 2020).
Develop Test
Push to Pre-
Production-
Stage
E2E
TestAuto Auto Auto
Develop Test
Push to Pre-
Production-
Stage
E2E
Test
Release to
ProductionAuto Auto Auto Manual
Develop Test
Push to Pre-
Production-
Stage
E2E
Test
Release to
ProductionAuto Auto Auto Auto
Continuous Integration (CI)
Continuous Delivery (CD)
Continuous Deployment
6 Development of the Methodology 61
Concept Drift
The fact that relationships between the model inputs and outputs can change is called
concept drift (Samuylova, 2020). Even if the data distributions remain the same, the model
may not describe the real world as well as before. The change in relationship can be gradual,
sudden, or even seasonal. As an example of gradual concept drift from manufacturing, the
mechanical wear of equipment causes slightly different results under the same process
parameters (Samuylova, 2020).
6.3.2 Monitoring Levels
Monitoring is needed to detect the beforementioned phenomena. Based on Waterworth
(2019), there are three layers to the problem as shown in Figure 6.12. Starting from the
bottom of the pyramid, it is necessary to understand what a system is doing, before being
able to monitor a system and ensuring that it is working as planned. Monitoring itself does
not create any value but always requires an analysis on how to improve the system.
Figure 6.12: Levels of monitoring
6.3.2.1 Understand
The goal is observability, which means making the behavior observable. There are three
ways of achieving the goal called the pillars of observability (Sridharan, 2018):
• Metrics
• Logs
• Traces
Metrics
Metrics are a numeric representation of data measured over intervals of time (Sridharan,
2018). Saha and Bose (2021) build a model monitoring metrics stack with three different
types of metrics. Firstly, there are operations metrics for identifying ML system health issues
Understand
What is the system doing?
Monitor
Is the system working?
Analyze
How to improve the system?
6 Development of the Methodology 62
including latency, memory and CPU usage as well as system uptime. Operational metrics
are independent of both the underlying data and the ML model. The second type of metrics
are performance metrics which only depend on the ML model by measuring its performance
over time. ML-specific metrics are applied which comply with the respective learning task.
Whereas performance metrics allow to identify a concept drift, stability metrics as the third
component of the metrics stack aim to detect data drifts. In doing so, stability metrics, e.g.,
the Population Stability Index and Characteristic Stability Index, depend on underlying data
and the ML model.
Logs
An event respectively data log is an immutable, timestamped record of discrete events that
happened over time (Sridharan, 2018). Logs are used to capture events like user access or
errors as well as data which was given to the model as prediction input. From both an
operational and ML-specific view, standardized logging messages facilitate the monitoring
process.
Traces
A trace is a representation of a series of causally related distributed events that encode the
end-to-end request flow through a distributed system (Sridharan, 2018). It allows to follow a
signal through the whole system including all services involved in the request and
understand where issues may arise.
Explainability goes one step further than observability and makes the decisions not only
observable but humanly interpretable by the end user (Bhatt et al., 2020). In manufacturing
processes, explainability can increase the trustworthiness of predictions if the model is not
seen as a black box (Goldman et al., 2021). In the AutoML pipeline (Figure 5.1), certification
of the ML model is the very last step. Explainability is one key element for certification, but
due to the complexity of the topic it is not further elaborated here.
6.3.2.2 Monitor
Based on the understanding of the system, the monitoring itself can take place. Two main
approaches for monitoring with diverging purposes exist.
Dashboards
On the one hand, dashboards offer an overview of a system’s state by providing multiple
metrics (Newman, 2016). This high-level overview focuses on metrics as metrics aggregate
information but also includes logs and traces.
Alerts
Alerts, on the other hand, notify a specified recipient of critical conditions of the system
based on certain pre-defined thresholds (Newman, 2016). Thresholds are based on metrics,
but notifications can also be sent in connection with logs and traces.
6 Development of the Methodology 63
The way how dashboards and alerts are set up depends on the character of the application.
For containerized applications, existing and standardized monitoring solutions are the way to
go. Nonetheless, monitoring functionalities can also be integrated into the main application
which results in additional requirements in the planning phase of the developed application.
6.3.2.3 Analyze
Once a problem is detected, a root cause analysis as depicted in Figure 6.13 allows to
identify if the problem is of an operational nature. If an operational problem occurred, a step
back to productionizing & testing is made. If the problem is not of technical nature but caused
by the decrease of the ML model’s performance, the last step of the methodology, the
retraining, is realized.
Figure 6.13: Analysis flow chart
For the purpose of a root cause analysis, all available information in form of metrics, logs and
traces are used. Technical problems or operational performance issues are identified through
error logs. Standardized logging messages facilitate the analysis. As this topic is also very
relevant for traditional software, approaches for automating and efficient execution of
debugging are to be considered.
Regarding the ML models, the two described causes for ML model decay, data drift and
concept drift, need to be analyzed using logged data from real-life production. Statistical
testing allows to identify data drift and outliers (Ackerman et al., 2020). Similarly,
sophisticated methods can be applied for the detection of concept drifts (Nishida &
Yamauchi, 2007). Similar to the debugging, a standardized and automated procedure for the
analysis is to be strived for.
6.4 Retraining
In the monitoring step preceding the retraining, it is detected if and when a productionized
model needs to be retrained. Furthermore, the root of decreasing model performance is
identified which serves as an important input for the retraining comprising actions based on
the analysis.
Productionizing
& Testing
Retraining
Operational
Problem
Root cause
analysis
Yes
No
Problem
detected
6 Development of the Methodology 64
6.4.1 Pre-considerations: Retraining Effect
As a remedy to the unavoidable degradation of model performance over time, models need
to be refreshed. Figure 6.14 shows the decrease of the quality of static models which are not
retrained. Only through retraining, a constantly high model quality can be achieved.
Figure 6.14: Impact of refreshing on model quality based on Thomas and Mewald (2019)
6.4.2 Retraining Decisions
Figure 6.15 illustrates the relevant decisions made regarding the retraining. These
interconnected decisions refer to the trigger, extent and execution of retraining (Patruno,
2019).
Figure 6.15: Retraining decisions
Trigger of Retraining
It is to be determined how the retraining is triggered. Mainly, there are two main approaches
(Patruno, 2019). One option is to retrain a model based on alerts and the subsequent
analysis in the monitoring phase. Alternatively, the moment of retraining complies with a fixed
schedule. This periodic retraining is used for recurring events or strong seasonal influences.
As seen in use cases in chapter 3.1.2, production highly depends on seasons due to
changes in parameters such as temperature and humidity.
Online learning models represent a special case as they are retrained continuously. The
decision between online and offline learning is made during deployment design in chapter
6.1.1.
Static Models Refreshed Models
Time
Mo
de
l Q
ua
lity
Time
Mo
de
l Q
ua
lity
Execution
How is the retraining executed?Extent
How extensive is the retraining?
Trigger
How is the retraining triggered?
6 Development of the Methodology 65
Extent of Retraining
Primarily, retraining refers to re-build an ML model on a new set of training data set without
making any changes to the model itself (Patruno, 2019). The pipeline containing data
preparation steps and an ML algorithm with its hyperparameters stays the same. For the
special case of online learning models, with each new data point the algorithm is trained
which does not include changes to the pipeline.
Depending on the conducted root cause analysis, it may be necessary to tune the model, not
feed the latest data into the existing one (Samuylova, 2020). These adjustments include
changes to features or the selected algorithm and go beyond ingesting new data. Therefore,
this can be referred to as remodeling rather than retraining.
In order to ensure that the new model is improved with respect to the detected data drift or
concept drift, the performance evaluation of the respective set up is crucial. The selection of
adequate measures is described in chapter 3.2.4.
Execution of Retraining
For executing the retraining there are two contrary approaches. In the manual case, activities
for retraining models are executed manually by, e.g., a data scientist. An automated
retraining is especially beneficial if the monitoring is also set up in an automated manner
(Patruno, 2019). AutoML libraries aim to build the whole pipeline from data preparation to
hyperparameter tuning automatically without supervision. AutoML can also be used for
retraining (Kavikondala et al., 2019). Currently, available AutoML tools are not yet mature
and performant enough to fulfill the task satisfactorily (Krauß et al., 2020).
Retraining is the last phase of deployment, but that does not mean the deployment ends at
this point. A retrained model is productionized & tested again followed by monitoring and
ultimately another retraining process.
6.5 General Aspects for Deployment
In parallel to the deployment phases deployment design, productionizing and testing,
monitoring and retraining, there are overarching concepts which represent important factors
for all of the mentioned phases. Specifically, roles and competencies as well as tools and
frameworks are covered in the following.
6.5.1 Roles and Competencies
As presented in chapter 2, a key factor for failed deployments is the coordination between
different stakeholders. Figure 6.16 shows the three involved types of expertise that in
combination allows a successful deployment. There is the process respectively business
competence in form of domain knowledge, data science competence and DevOps
competence (Samiullah, 2019).
6 Development of the Methodology 66
Figure 6.16: Collaboration between process, data science and DevOps competence
Data science and DevOps can be analyzed together as it is possible to integrate the two
competence fields into one. Figure 6.17 shows dimensions of data science and DevOps
competencies. These dimensions can be used to evaluate the maturity of an organization.
This maturity evaluation is to be executed before as pre-considerations for the deployment in
form of a business and competence analysis. Applying the technique of a maturity model is
one way of assessing an enterprise understanding their current and target states.
Figure 6.17: Maturity model dimensions based on Hornick (2018)
The underlying key to success for deployment is the collaboration between roles and
responsibilities in the same organization, especially between data science and DevOps. Data
science responsibilities comprise all steps of the CRISP-DM from business understanding to
evaluation. For deployment, the information is passed to a DevOps team from a software
engineering background which industrializes the data science project by recoding in another
Data Science
Competence
DevOps
Competence
Process
Competence
Collaboration
Data Science
& DevOps
RolesData
Awareness
Methodology
Strategy Data Access
Asset
Management
Tools Scalability
6 Development of the Methodology 67
language, model evaluation and testing, scheduling, monitoring features and deployment
itself (Gherman, 2020).
In order to address the specific challenges of deployment, a new specialized role in form of
an ML engineer has emerged, which is placed between software engineer and data
scientists. Small companies do not have the resources to employ a data science and
DevOps team but rather require one role covering the whole range from data science to
software engineering (Odegua, 2020).
Involved parties during the ML life cycle including the deployment and the model
maintenance can be managed with the tool of a RACI matrix. Stakeholders are classified as
responsible, accountable, consulted or informed. By means of the matrix, the roles existing in
the specific company are clearly distinguished to enable a successful execution of the ML
project ending with the deployment (Wehrstein, 2020).
6.5.2 Tools and Frameworks
When talking about tools and frameworks, a key factor is the decision between open-source
and closed-source solutions. Table 6.3 shows the respective advantages and disadvantages.
Generally, the pros of open-source are the cons of closed-source and vice versa.
Table 6.3: Pros and cons of open-source and closed-source tools (Matteson, 2018)
Open-source tools Closed-source tools
Pros No direct cost
High flexibility
No licensing requirements
Independency from vendor
Support by vendor
Official documentation
Low complexity
Routine updates
Cons No official support
Poor documentation
High complexity
Slow fixes
Cost of service
Low flexibility
License schemes
Dependence on vendor
Software and hardware help to effectively deploy ML models. In order to find the best tool for
the task at hand, options must be compared regarding the following factors (Odegua, 2020):
• Efficiency: How efficient is the tool or framework in production? Efficiency refers to
usage of resources like memory, CPU, or time. These factors directly affect the
project performance, reliability, and stability.
6 Development of the Methodology 68
• Popularity: How popular is the tool in the developer community? High popularity,
especially of open-source solutions, can indicate that a tool or framework works well
and is actively in use. However, there may be less popular, often proprietary
solutions, that are even more efficient.
• Support: How is support for the tool or framework? For open-source solutions, the
availability of resources like tutorials and exemplary use cases provided by the
community defines if a good support is given. For proprietary solutions, the support is
evaluated by the service quality by the provider.
Tools and frameworks are applied in all stages of deployment ranging from solutions to
manage the whole ML lifecycle to specialized software for one task. Therefore, in case of
multiple software solutions all components must be compatible with each other. Furthermore,
the experience of the involved team with said solutions is a relevant factor in the decision.
In the scope of this thesis are medium sized companies in the manufacturing industry. For
this kind of organizations, open-source solutions are preferrable as they cover a huge variety
of functions that do not need to be implemented during the deployment.
7 Verification and Validation 69
7 Verification and Validation
In this chapter, the developed methodology is verified and validated in order to assess the
success of the development. Therefore, the methodology is evaluated in the same manner
as existing approaches, implemented for an exemplary use case and discussed in expert
interviews.
According to Balci (1998, p. 336) verification examines the accuracy of transforming a model
from one form into another. Validation, on the other hand, examines if a model behaves with
satisfactory accuracy consistent to the study objectives. In other words, verification is about
building the model right, whereas validation is about building the right model. The approach
by Balci was developed with respect to models and simulation studies. Conceptual models
like the developed methodology, which only have a descriptive structure, cannot be
evaluated with respect to real world behavior and therefore require different methods to
perform verification and validation (Robinson, 2006, p. 796). Rather, conceptual models must
be validated by analyzing if they contain all the necessary details to achieve the goals of the
study (Robinson, 2014, p. 254).
For this purpose, the standards for system, software, and hardware verification and
validation published by the IEEE Computer Society are applied to the developed
methodology. Following the provided definitions, verification describes the process of
evaluating that a system conforms to requirements imposed at the start of the development.
Validation, on the other hand, is defined as the process of providing evidence that the system
satisfies its intended use and user needs.
7.1 Verification
By means of the verification, it is checked if the procedure meets the content-related
requirements (chapter 5.1.1). As these requirements coincide with the criteria which were
used during the analysis of the state of the art in chapter 4, the methodology is evaluated
exactly like existing approaches. Both evaluations are internal processes executed without
the involvement of external parties. Table 7.1 contains the verification results showing the
degree to which the methodology fulfills the previously defined requirements.
7 Verification and Validation 70
Table 7.1: Evaluation of developed methodology
Explanation:
● Completely fulfilled
◕ Mainly fulfilled
◑ Partly fulfilled
◔ Sparsely fulfilled
○ Not at all fulfilled
Dep
loym
en
t o
f M
L
ML
fo
r P
red
icti
ve Q
uali
ty
Str
ate
gic
Pla
nn
ing
Op
era
tio
nal
Realizati
on
Gu
idelin
e
Tra
nsfe
rab
ilit
y
Own methodology ● ● ● ◑ ● ◕
The developed methodology focuses on the deployment process of ML models into
production, especially for use cases of predictive quality in manufacturing processes.
Strategic aspects are covered in depth providing relevant decisions from a high-level
perspective. However, the operational realization is not treated in equal depth. This is due to
the complexity of the implementation which cannot be covered completely in the scope of
this thesis. Relevant aspects for the implementation are introduced but based on the
provided information more specialized sources need to be consulted. The methodology
serves as a guideline as it illustrates the consecutive steps from end-to-end. It is transferable
to further use cases with the limitation that these use cases fall into the scope of this thesis.
Not all specific predictive quality use cases that might be found in real life, such as image
recognition, are covered.
7.2 Validation
By means of the validation, it is assessed if the described procedure behaves with
satisfactory accuracy in the application. For this purpose, the formal requirements from
chapter 5.1.2 are consulted. The methodology was validated through expert interviews on
the one hand and a practical application on the other.
7.2.1 Expert Interviews
Expert interviews were conducted to check if the methodology represents the defined
observation area, simplifies the overall system to the relevant attributes and elements, is
consistent with reality and is free of contradictions. These criteria cannot be evaluated
without external expertise. Thus, it is necessary to validate the methodology based on the
acceptance of external customers and the suitability for the defined application.
7 Verification and Validation 71
Interviews with Deployment Experts
One-on-one interviews with experts for deployment, who already have realized deployments,
represent a bottom-up approach for validation. It is analyzed if existing deployments provided
by the experts can be re-built with the methodology. Subsequently, a top-down perspective
was taken in the interviews by asking if experts can use the methodology for realizing new
deployments. Based on the interviews, the methodology was completed by adding missing
aspects or resolving discrepancies to their experience.
Workshops with Production Experts
In addition, the methodology was validated in workshops with production experts with the
following structure. First, each phase of the methodology was presented in a separate
workshop. Then, input and feedback by the participants were gathered with respect to
completeness, understandability, and applicability. Based on the participants comments, the
concept was refined by adding and adjusting content to the expressed needs. As the goal of
the work is to provide a guideline with relevant factors for practice, the input of practitioners is
a valuable source for validation. A description of the applied procedure for validation can be
found in chapter A.1. of the appendix.
7.2.2 Practical Application
Through the practical application, it is evaluated if the methodology can be applied by a
specific user in a target-oriented way, is providing useful answers, is easy to apply and
interpret, and if its use is associated with low effort. In form of a case study, the methodology
is applied to the context of predictive quality. Deploying an ML model in order to predict the
product quality in a production process represents a common use case in the manufacturing
industry, especially for high-tech products with strict quality standards.
A real-life data set from semiconductor manufacturing provided by the University of California
Irvine publicly at https://archive.ics.uci.edu/ml/datasets/SECOM was used for the
implementation. Based on an existing performant model, an exemplary deployment was
realized with the help of the developed methodology.
As a first step, the architecture was designed based on the technical requirements described
in the deployment design (chapter 6.1). Predictions are needed in real time and are
consumed through a web app in the browser. The model is embedded into the service to
have only one final application. Given as an input from the model building phase, offline
training is chosen. Finally, the application is hosted locally on-premises. Figure 7.1 illustrates
the architecture setup that was individually defined for the use case.
7 Verification and Validation 72
Figure 7.1: Webservice architecture for use case
Input data is provided in form of CSV files, which are used by the webservice containing the
model to enable the ML functionality. The service is made accessible in the browser, where
the user has the option of triggering predictions and monitoring model versions. Figure 7.2
shows the home page of the service which is called at 127.0.0.1:5000 in the browser’s
address bar.
Figure 7.2: Screenshot of home page of the webservice
Practically, the deployed application behaves like a regular website which makes the
deployed app user-friendly. In the “Prediction” tab, a data file with production data can be
selected and submitted to make a prediction (Figure 7.3). The predictions are then calculated
and presented as indicated in Figure 7.4. Predicted fails are highlighted in red so that the
corresponding worker knows which product requires thorough quality testing. When
accessing the “Monitoring” tab, the active model version is evaluated on a holdout data set
and the respective metrics are presented (Figure 7.5).
Webservice
Prediction in real time
Web app
Model embedded
Offline learning
On-premises hosting
ML ModelData Input
As CSV files
User Access
Via browser
7 Verification and Validation 73
Figure 7.3: Screenshot of prediction input
Figure 7.4: Screenshot of prediction output
7 Verification and Validation 74
Figure 7.5: Screenshot of monitoring
At this point, an excerpt of the most relevant source code is explained. More source code
including explanations is available in chapter A.2. of the appendix. The implementation of the
application as well as the monitoring and training functionality follows the steps of the
methodology described from chapter 6.2 to 6.4.
In order to access the application through the browser, a local server with Flask is built
through the Python script app.py which launches the whole application. It defines what is
executed when a certain endpoint (e.g., 127.0.0.1:5000/prediction) is accessed.
First, necessary imports are made to be able to use Python libraries. But also, the predict
method of the ML model, located in another folder of the project, is imported.
# imports
import pandas as pd
import os
import joblib
from datetime import datetime
from flask import Flask, request, redirect, url_for, render_template
# import of functionality within the application
import configuration
from ML_model import predict
Then, the app is defined and functions for each endpoint are written.
# definition of the app
app = Flask(__name__)
# standard endpoint
@app.route('/', methods=['GET'])
def home():
# by accessing the endpoint a GET request is triggered
if request.method == 'GET':
# index.html file is returned and displayed
return render_template("index.html")
7 Verification and Validation 75
When going to the prediction endpoint (GET request), the browser lets the user choose a file.
After the input is sent (POST request), the webservice displays the output in form of a table
with the prediction results.
# prediction endpoint
@app.route('/prediction', methods=['GET', 'POST'])
def get_prediction():
if request.method == 'GET':
# files of production data are listed
files = os.listdir(configuration.PRODUCTION_DATA_FOLDER)
# list of files is passed to prediction_input.html
# html file is returned and displayed
return render_template("prediction_input.html", list_of_files = files)
# by pressing the submit button a POST request is made
if request.method == 'POST':
# with a POST request the predictions are triggered
text = "Time and hour of prediction: " + datetime.now().strftime("%d/%
m/%Y %H:%M:%S")
# get the selected option from the dropdown menu
selected_file = request.form.get("dropdown")
# check if an option was selected
if selected_file != '':
# create empty dataframe
df = pd.DataFrame()
# build path of file
filepath = os.path.join(configuration.PRODUCTION_DATA_FOLDER, sele
cted_file)
# load input data from selected file
input_data = pd.read_csv(filepath)
# fill dataframe with predictions from model
df = predict.get_prediction_df(input_data)
# display prediction_output.html to show the predictions as a table
return render_template("prediction_output.html", pred_to_print = text,
table=df.to_html(index = False, header=True, table_id="result_table"))
7 Verification and Validation 76
For the monitoring endpoint, the evaluation metrics are retrieved from the model and
displayed.
# monitoring endpoint
@app.route('/monitoring', methods=['GET'])
def get_evaluation():
if request.method == 'GET':
# get version number of model
version_number = predict.get_version_number()
# get evaluation metrics scores for model
scores = predict.get_metrics_scores()
# display results
return render_template("monitoring.html", ver=version_number, acc=scor
es[0], pre=scores[1], rec=scores[2], f1=scores[3])
In order to launch the application with all the beforementioned functions, the main method of
app.py is executed.
# main method
if __name__ == '__main__':
app.run(debug=False)
As stated before, more details on the hands-on realization can be found in the appendix.
8 Conclusion 77
8 Conclusion
In this closing chapter, the components of thesis are passed in review. Furthermore, the
relevance of the work on a social and personal level is highlighted. Finally, an outlook on
future research is given.
Chapter 1 introduced ML as a powerful technology for applications in manufacturing,
especially for predicting quality. Mainly due to a missing standardized procedure, the
deployment of ML models presents a crucial barrier to unfolding the full potential of ML
solutions for businesses. Based on the identification of missing support during the selection
process as the crucial impediment, the goal to develop a methodology for ML model
deployment applied to the context of predictive quality in production was derived.
A more detailed descriptions of the problem in practice was provided in chapter 2. An
analysis of the current state of deploying ML models in practice showed evidence on the
unsatisfactory percentage of successful deployments and the associated waste of resources.
Thereupon, the main challenges leading to the failure of many ML projects were identified so
that they can be addressed in the methodology. Due to the topic’s importance and
highlighted room for improvement, the need for further investigation was derived.
Deploying ML models for predictive quality in production requires the combination of
knowledge about quality management, ML, and software engineering so that a basic
understanding of the three areas of investigation is essential. Thus, chapter 3 introduced
relevant theoretical concepts that are necessary to comprehend existing approaches but also
used in the development of the methodology.
In chapter 4, an analysis of the state of the art was conducted. Criteria were defined in
accordance with the set objective to evaluate existing approaches. Both academic and gray
literature was reviewed in order to fully capture the topic. Through the consultation of multiple
sources, the search results were selected and analyzed. As a result, a research gap could be
identified as existing approaches do not serve to fulfill the set objective to a satisfactory
degree making further research necessary.
Before the elaboration of the methodology itself, it is outlined in chapter 5 by defining the
requirements, narrowing down the scope and establishing the relation to a reference
framework. Precise requirements aim to ensure that the overall objective is fulfilled and need
to be considered before the development of the methodology. Likewise, the research area
must be clearly bounded beforehand. Thereby, the methodology’s structure needs to comply
with a framework which is given as a reference from previous research activities in the field.
The subsequent chapter 6 comprises the development of the methodology covering the
phases deployment design, productionizing & testing, monitoring, and retraining. As a
summary of the developed methodology, Figure 8.1 shows a final overview of the relevant
steps and decisions in each of the four phases. Moreover, roles and competencies
respectively tools and frameworks as overarching aspects for deployment were described in
order to capture the deployment in its entirety.
8 Conclusion 78
After the elaboration of the methodology, it was verified and validated through expert
interviews and practical implementation in chapter 7. By means of the evaluation of the
methodology with defined criteria, it was shown that it fulfills the set goal to a satisfactory
degree. The developed conceptual procedure was validated by a hands-on implementation
which can be used as starting point for deploying ML models in organizations.
By means of the validation, it was demonstrated that this thesis contributes to improving the
deployment process of ML models. In predictive quality applications a successful deployment
leads to an increase in efficiency in production and ultimately to the reduction of cost. In the
context of social responsibility, the work has ethical implications by ensuring the profitability
of production which facilitates the preservation of jobs in the manufacturing industry.
On a personal level, the thesis helped to further develop transversal competencies. An
awareness of contemporary issues was achieved by identifying and interpreting the use of
ML models in the field of industrial engineering and predictive quality in particular. Moreover,
the competence of handling specific instruments relevant for the field of investigation was
enhanced. Data science and ML deployment technologies ranging from the programming
language python to specific libraries and tools were selected and applied.
To conclude this thesis, the developed methodology can be used as a starting point for
further research. Due to the dynamic and extent of ML as an area of investigation, no
concept can claim to be complete and valid for all situations. Therefore, future lines of
research can explore each phase of the methodology in more depth especially the
integration of DevOps techniques into the ML life cycle. Ultimately, all these techniques aim
to automate the whole end-to-end ML pipeline from data integration to deployment. With
respect to possible applications, it can be investigated in the future how the methodology can
be applied to further use cases within predictive quality but also to companies outside of
manufacturing industry.
V Bibliography 79
V Bibliography
Ackerman, S., Farchi, E., Raz, O., Zalmanovici, M., & Dube, P. (2020). Detection of data drift
and outliers affecting machine learning model performance over time.
http://arxiv.org/pdf/2012.09258v2
Ackermann, K., Walsh, J., Unánue, A. de, Naveed, H., Navarrete Rivera, A., Lee, S.‑J.,
Bennett, J., Defoe, M., Cody, C., Haynes, L., & Ghani, R. (2018). Deploying machine
learning models for public policy. In Y. Guo & F. Farooq (Eds.), Proceedings of the 24th
acm sigkdd international conference on knowledge discovery & data mining (pp. 15–22).
ACM. https://doi.org/10.1145/3219819.3219911
Agrawal, S., & Mittal, A. (2020). MLOps: 5 Steps to Operationalize Machine Learning
Models: Automate and Productize Machine Learning Algorithms. Informatica.
https://ai4.io/wp-content/uploads/2020/08/2020-08-
07_5f2d921aa925b_MLOps.resources.asset_.faf63486bc68f826d48f086366e9a96d.pdf
Akyildiz, B. (2020a). How to monitor models. https://bugra.github.io/posts/2020/11/24/how-to-
monitor-models/
Akyildiz, B. (2020b). How to serve models. https://bugra.github.io/posts/2020/5/25/how-to-
serve-model/
Algorithmia. (2019). 2020 state of enterprise machine learning.
https://info.algorithmia.com/hubfs/2019/Whitepapers/The-State-of-Enterprise-ML-
2020/Algorithmia_2020_State_of_Enterprise_ML.pdf?utm_campaign=The%20Batch&utm
_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-
9SrICt7U8VAGt4GwFxt47WmEhatriglgLs_5xcaO6b0zG4wsu7No-l5jLL-ypPEck0QMdT
Alpaydin, E. (2014). Introduction to Machine Learning (3rd ed.). Adaptive Computation and
Machine Learning series / Ethem Alpaydin. MIT Press.
Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., &
Zimmermann, T. (2019). Software engineering for machine learning: A case study. In
2019 ieee/acm 41st international conference on software engineering: Software
engineering in practice (icse-seip) (pp. 291–300). IEEE. https://doi.org/10.1109/ICSE-
SEIP.2019.00042
Ariharan, V., Eswaran, S. P., Vempati, S., & Anjum, N. (2019). Machine learning quorum
decider (mlqd) for large scale iot deployments. Procedia Computer Science, 151, 959–
964. https://doi.org/10.1016/j.procs.2019.04.134
Azevedo, A. (2008). Kdd, semma and crisp-dm: a parallel overview. In Iadis European
conference data mining (pp. 182–185).
Baier, L., Jöhren, F., & Seebacher, S. (2019, June 8). Challenges in the deployment and
operation of machine learning in practice. In Proceedings of the 27th European
conference on information systems (ecis), Stockholm & Uppsala, Sweden.
V Bibliography 80
Balci, O. (1998). Verification, validation, and testing. In J. Banks (Ed.), Handbook of
simulation (pp. 335–393). John Wiley & Sons, Inc.
Benington, H. D. (1983). Production of large computer programs. IEEE Annals of the History
of Computing, 5(4), 350–361. https://doi.org/10.1109/MAHC.1983.10102
Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R.,
Moura, J. M. F., & Eckersley, P. (2020). Explainable machine learning in deployment. In
M. Hildebrandt, C. Castillo, E. Celis, S. Ruggieri, L. Taylor, & G. Zanfir-Fortuna (Eds.),
Proceedings of the 2020 conference on fairness, accountability, and transparency
(pp. 648–657). ACM. https://doi.org/10.1145/3351095.3375624
Bignu, A. (2019). Web apps vs native apps: What is the best choice for a data scientist?
https://medium.datadriveninvestor.com/web-apps-vs-native-apps-what-is-the-best-choice-
for-a-data-scientist-3d31169d2335
bigwater.consulting. (2019). Software development life cycle (sdlc). BIG WATER
CONSULTING (BWC). https://bigwater.consulting/2019/04/08/software-development-life-
cycle-sdlc/
Björklund, T. (2017). Apis for non-techies (like myself). https://medium.com/apinf/apis-for-
non-techies-like-myself-259f60042ba
Boehm, B. W. (1988). A spiral model of software development and enhancement. Computer,
21(5), 61–72. https://doi.org/10.1109/2.59
Breck, E., Cai, S., Nielsen, E., Salib, M., & Sculley, D. (2017). The ml test score: A rubric for
ml production readiness and technical debt reduction. In J.-Y. Nie, Z. Obradovic, T.
Suzumura, R. Ghosh, R. Nambiar, & C. Wang (Eds.), 2017 ieee international conference
on big data: dec 11-14, 2017, boston, ma, USA : Proceedings. IEEE.
https://storage.googleapis.com/pub-tools-public-publication-
data/pdf/aad9f93b86b7addfea4c419b9100c6cdd26cacea.pdf
Brosset, P., Patsko, S., & Khadikar, A. (2019). Scaling ai in manufacturing operations: a
practicioner's perspective. Capgemini Research Institute.
Brüning, J., Denkena, B., Dittrich, M.‑A., & Hocke, T. (2017). Machine learning approach for
optimization of automated fiber placement processes. Procedia CIRP, 66, 74–78.
https://doi.org/10.1016/j.procir.2017.03.295
Cambridge Dictionary. (2014). Methodology.
https://dictionary.cambridge.org/de/worterbuch/englisch/methodology
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. R., & Wirth, R.
(2000). CRISP-DM 1.0: Step-by-step data mining guide. Copenhagen. SPSS.
Chen, J. (2020). Azure fundamental: Iaas, paas, saas. https://medium.com/chenjd-xyz/azure-
fundamental-iaas-paas-saas-973e0c406de7
Chigira, M. (2019). Continuous deployment tools. https://scoutapm.com/blog/continuous-
deployment-tools
V Bibliography 81
Choo, C. (2018). The cloud models: Iaas vs paas vs saas.
https://www.linkedin.com/pulse/cloud-models-iaas-vs-paas-saas-clara-choo
Crankshaw, D., & Gonzalez, J. (2018). Prediction-serving systems. Queue, 16(1), 83–97.
https://doi.org/10.1145/3194653.3210557
Cumbie, B. A., Jourdan, Z., Peachy, T., Dugo, T. M., & Craighead, C. W. (2005). Enterprise
resource planning research: Where are we now and where should we go from here?
Journal of Information Technology Theory and Application (JITTA), Vol. 7(Iss. 2), 21–36.
https://aisel.aisnet.org/jitta/vol7/iss2/4
Debauche, O., Mahmoudi, S., Mahmoudi, S. A., Manneback, P., & Lebeau, F. (2020). A new
edge architecture for ai-iot services deployment. Procedia Computer Science, 175, 10–19.
https://doi.org/10.1016/j.procs.2020.07.006
Decosmo, J. (2019). What nobody tells you about machine learning.
https://www.forbes.com/sites/forbestechcouncil/2019/04/23/what-nobody-tells-you-about-
machine-learning/#50b479f55ac1
Dowling, J. (2019). Guide to file formats for machine learning: Columnar, training,
inferencing, and the feature store. https://towardsdatascience.com/guide-to-file-formats-
for-machine-learning-columnar-training-inferencing-and-the-feature-store-2e0c3d18d4f9
Druzkowski, M. (2017). Building ml models is hard. Deploying them in real business
environments is harder. https://medium.com/ocadotechnology/building-ml-models-is-hard-
deploying-them-in-real-business-environments-is-harder-c2a0433f527
Eising, P. (2017). What exactly is an api? https://medium.com/@perrysetgo/what-exactly-is-
an-api-
69f36968a41f#:~:text=Application%20Programming%20Interface%20(API),to%20commu
nicate%20with%20one%20another.&text=JSON%20or%20XML”.-
,The%20API%20is%20not%20the%20database%20or%20even%20the%20server,that%2
0can%20access%20a%20database.
en.proft.me. (2015). Types of machine learning algorithms.
https://en.proft.me/2015/12/24/types-machine-learning-algorithms/
Escobar, C. A., Morales-Menendez, R., & Macias, D. (2020). Process-monitoring-for-quality
— a machine learning-based modeling for rare event detection. Array, 7, 100034.
https://doi.org/10.1016/j.array.2020.100034
Everett, G. D., & McLeod, R. (2007). Software testing: Testing across the entire software
development life cycle. Wiley-Interscience.
http://www.loc.gov/catdir/enhancements/fy0739/2007001282-b.html
Fagerberg, D. (2015). Container vs package deployments.
https://lastbytes.wordpress.com/2015/09/15/container-vs-package-deployments/
Figalist, I., Elsner, C., Bosch, J., & Olsson, H. H. (2020). An end-to-end framework for
productive use of machine learning in software analytics and business intelligence
solutions. In M. Morisio, M. Torchiano, & A. Jedlitschka (Eds.), Lecture Notes in Computer
V Bibliography 82
Science. Product-Focused Software Process Improvement (Vol. 12562, pp. 217–233).
Springer International Publishing. https://doi.org/10.1007/978-3-030-64148-1_14
Flach, P. (2012). Machine Learning: The Art and Science of Algorithms That Make Sense of
Data. Cambridge University Press. https://doi.org/10.1017/CBO9780511973000
Forsberg, K., & Mooz, H. (1998). System engineering for faster, cheaper, better. Center for
Systems Management, Inc.
https://web.archive.org/web/20030420130303/http://www.incose.org/sfbac/welcome/fcb-
csm.pdf
Frye, M., & Schmitt, R. H. (2019). Quality improvement of milling processes using machine
learning-algorithms. In 16th imeko tc10 conference on testing, diagnostics and inspection
2019: testing, diagnostics and inspection as a comprehensive value chain for quality and
safety, Berlin, Germany.
Galli, S. (2020). How to build and deploy a reproducible machine learning pipeline.
https://trainindata.medium.com/how-to-build-and-deploy-a-reproducible-machine-learning-
pipeline-20119c0ab941
Galli, S., & Samiullah, C. (2021). Deployment of machine learning models. Udemy.
https://www.udemy.com/course/deployment-of-machine-learning-models/
Garousi, V., Felderer, M., & Mäntylä, M. V. (2019). Guidelines for including grey literature
and conducting multivocal literature reviews in software engineering. Information and
Software Technology, 106, 101–121. https://doi.org/10.1016/j.infsof.2018.09.006
Géron, A. (2018). Praxiseinstieg Machine Learning mit Scikit-Learn und TensorFlow:
Konzepte, Tools und Techniken für intelligente Systeme ((K. Rother, Trans.)) (1. Auflage).
O'Reilly. https://www.oreilly.de/buecher/13111/9783960090618-praxiseinstieg-machine-
learning-mit-scikit-learn-und-tensorflow.html
Gherman, A. (2020). Data engineering and data science collaboration processes.
https://towardsdatascience.com/data-engineer-and-data-science-collaboration-processes-
b2d7abcfc74f
Gisselaire, L., Cario, F., Guerre-berthelot, Q., Zigmann, B., Du Bousquet, L., & Nakamura, M.
(2019). Toward evaluation of deployment architecture of ml-based cyber-physical
systems. In 2019 34th ieee/acm international conference on automated software
engineering workshop (asew) (pp. 90–93). IEEE.
https://doi.org/10.1109/ASEW.2019.00036
GitLab. (2021). Introduction to gitlab flow. https://docs.gitlab.com/ee/topics/gitlab_flow.html
Goldman, C. V., Baltaxe, M., Chakraborty, D., & Arinez, J. (2021). Explaining learning
models in manufacturing processes. Procedia Computer Science, 180, 259–268.
https://doi.org/10.1016/j.procs.2021.01.163
Gonfalonieri, A. (2019). Why is machine learning deployment hard?
https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd
V Bibliography 83
Halstenberg, J., Pfitzinger, B., & Jestädt, T. (2020). DevOps. Springer Fachmedien
Wiesbaden. https://doi.org/10.1007/978-3-658-31405-7
Harlann, I. (2017). Devops is a culture, not a role! https://neonrocket.medium.com/devops-is-
a-culture-not-a-role-be1bed149b0
Harrington, P. (2012). Machine learning in action. Manning Publications Co.
Hornick, M. (2018). A data science maturity model for enterprise assessment. Oracle.
https://cdn.app.compendium.com/uploads/user/e7c690e8-6ff9-102a-ac6d-
e4aebca50425/2178fa83-87f2-4bdc-a2ff-
384a5382d3bd/File/146aef5f88d7e7f646fb9280c7b5e25f/a_data_science_maturity_model
_for_enterprise_assessment_wp.pdf
Houghton, J. (2018). Understanding what apis are all about. https://medium.com/vody-
techblog/understanding-what-apis-are-all-about-ff2513b76a55
Hunt, X. (2017). Online learning: Machine lerning's secret for big data.
https://blogs.sas.com/content/subconsciousmusings/2017/10/17/online-learning-machine-
learnings-secret-big-data/
IEEE Computer Society. IEEE Standard for System, Software, and Hardware Verification
and Validation. Piscataway, NJ, USA. IEEE.
IEEE Computer Society. IEEE Standard Glossary of Software Engineering Terminology.
Piscataway, NJ, USA. IEEE.
Jeffcock, P. (2018). What's the difference between ai, machine learning, and deep learning?
Oracle. https://blogs.oracle.com/bigdata/difference-ai-machine-learning-deep-learning
John, M. M., Holmström Olsson, H., & Bosch, J. (2021). Architecting ai deployment: A
systematic review of state-of-the-art and state-of-practice literature. In E. Klotins & K.
Wnuk (Eds.), Lecture Notes in Business Information Processing. Software Business (Vol.
407, pp. 14–29). Springer International Publishing. https://doi.org/10.1007/978-3-030-
67292-8_2
Johnson, K. (2019). Ai predictions for 2019. VentureBeat.
https://venturebeat.com/2019/01/02/ai-predictions-for-2019-from-yann-lecun-hilary-
mason-andrew-ng-and-rumman-chowdhury/
Kavikondala, A., Muppalla, V., Prakasha K., K., & Acharya, V. (2019). Automated retraining
of machine learning models. International Journal of Innovative Technology and Exploring
Engineering, 8(12), Article L33221081219, 445–452.
https://doi.org/10.35940/ijitee.L3322.1081219
Keeney, R. L. (1992). Value-focused thinking: a path to creative decision making. Cambridge
Mass.: Harvard University Press.
Keeney, R. L., & Gregory, R. S. (2005). Selecting attributes to measure the achievement of
objectives. Operations Research, 53(1), 1–11. https://doi.org/10.1287/opre.1040.0158
V Bibliography 84
Kervizic, J. (2019). Overview of the different approaches to putting machine learning (ml)
models in production. https://medium.com/analytics-and-data/overview-of-the-different-
approaches-to-putting-machinelearning-ml-models-in-production-c699b34abf86
Kimera, D., & Nangolo, F. N. (2020). Predictive maintenance for ballast pumps on ship repair
yards via machine learning. Transportation Engineering, 2, 100020.
https://doi.org/10.1016/j.treng.2020.100020
Kominos, C. G., Seyvet, N., & Vandikas, K. (2017). Bare-metal, virtual machines and
containers in openstack. In 2017 20th conference on innovations in clouds, internet and
networks (icin) (pp. 36–43). IEEE. https://doi.org/10.1109/ICIN.2017.7899247
Konstantinidis, F. (2020). Why and how to run machine learning algorithms on edge devices.
https://www.therobotreport.com/why-and-how-to-run-machine-learning-algorithms-on-
edge-devices/
Kotu, V., & Deshpande, B. (2019). Data Science (Second Edition). Morgan Kaufmann
Publishers.
Krauß, J. Automl benchmark in production. https://jonathankrauss.github.io/AutoML-
Benchmark/
Krauß, J., Pacheco, B. M., Zang, H. M., & Schmitt, R. H. (2020). Automated machine
learning for predictive quality in production. Procedia CIRP, 93, 443–448.
https://doi.org/10.1016/j.procir.2020.04.039
Larsen, J. (2019). Why do 87% of data science projects never make it into production?
VentureBeat. https://venturebeat.com/2019/07/19/why-do-87-of-data-science-projects-
never-make-it-into-production/
Lawton, G. (2020). 7 last-mile delivery problems in ai and how to solve them.
https://searchenterpriseai.techtarget.com/feature/7-last-mile-delivery-problems-in-AI-and-
how-to-solve-them
Lehmann, C., Goren Huber, L., Horisberger, T., Scheiba, G., Sima, A. C., & Stockinger, K.
(2020). Big data architecture for intelligent maintenance: A focus on query processing and
machine learning algorithms. Journal of Big Data, 7(1). https://doi.org/10.1186/s40537-
020-00340-7
Lichtenwalter, D., Burggräf, P., Wagner, J., & Weißer, T. (2021). Deep multimodal learning
for manufacturing problem solving. Procedia CIRP, 99, 615–620.
https://doi.org/10.1016/j.procir.2021.03.083
Liu, Y., Ling, Z., Huo, B., Wang, B., Chen, T., & Mouine, E. (2020). Building a platform for
machine learning operations from open source frameworks. IFAC-PapersOnLine, 53(5),
704–709. https://doi.org/10.1016/j.ifacol.2021.04.161
Lwakatare, L. E., Raj, A., Bosch, J., Olsson, H. H., & Crnkovic, I. (2019). A taxonomy of
software engineering challenges for machine learning systems: An empirical investigation.
In P. Kruchten, S. Fraser, & F. Coallier (Eds.), Lecture Notes in Business Information
Processing. Agile Processes in Software Engineering and Extreme Programming (Vol.
V Bibliography 85
355, pp. 227–243). Springer International Publishing. https://doi.org/10.1007/978-3-030-
19034-7_14
Matteson, S. (2018). How to decide if open source or proprietary software solutions are best
for your business. https://www.techrepublic.com/article/how-to-decide-if-open-source-or-
proprietary-software-solutions-are-best-for-your-business/
Mehta, P., Butkewitsch-Choze, S., & Seaman, C. (2018). Smart manufacturing analytics
application for semi-continuous manufacturing process – a use case. Procedia
Manufacturing, 26, 1041–1052. https://doi.org/10.1016/j.promfg.2018.07.138
Mitchell, T. M. (2010). Machine learning (International ed. [Reprint.]. McGraw-Hill series in
computer science. McGraw-Hill.
Mobley, R. K. (2002). An introduction to predictive maintenance (2. ed.). Butterworth-
Heinemann. http://www.loc.gov/catdir/description/els031/2001056670.html
Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018). Foundations of machine learning
(Second edition). Adaptive computation and machine learning. MIT Press.
Murray, P. E. (2006). Traditional development/integration/staging/production practice for
software development. https://dltj.org/article/software-development-practice/
Muthusamy, V., Slominski, A., & Ishakian, V. (2018). Towards enterprise-ready ai
deployments minimizing the risk of consuming ai models in business applications. In 2018
first international conference on artificial intelligence for industries (ai4i) (pp. 108–109).
IEEE. https://doi.org/10.1109/AI4I.2018.8665685
Nalbach, O., Linn, C., Derouet, M., & Werth, D. (2018). Predictive quality: Towards a new
understanding of quality assurance using machine learning tools. In W. Abramowicz & A.
Paschke (Eds.), Lecture Notes in Business Information Processing. Business Information
Systems (Vol. 320, pp. 30–42). Springer International Publishing.
https://doi.org/10.1007/978-3-319-93931-5_3
National Research Council. (1995). Unit Manufacturing Processes. National Academies
Press. https://doi.org/10.17226/4827
Neal Analytics. (2020). Machine learning operations (mlops). https://nealanalytics.com/wp-
content/uploads/2020/07/MLOps-Datasheet.pdf
Newman, A. (2016). How to use dashboards and alerts for data monitoring.
https://www.loggly.com/blog/how-to-use-dashboards-and-alerts-for-data-monitoring/
Ngo, Q. H., & Schmitt, R. H. (2016). A data-based approach for quality regulation. Procedia
CIRP, 57, 498–503. https://doi.org/10.1016/j.procir.2016.11.086
Nishida, K., & Yamauchi, K. (2007). Detecting concept drift using statistical testing. In V.
Corruble, M. Takeda, & E. Suzuki (Eds.), Lecture Notes in Computer Science. Discovery
Science (Vol. 4755, pp. 264–269). Springer Berlin Heidelberg. https://doi.org/10.1007/978-
3-540-75488-6_27
V Bibliography 86
Odegua, R. (2020). How to put machine learning models into production.
https://stackoverflow.blog/2020/10/12/how-to-put-machine-learning-models-into-
production/
Oxford University Press. (2020). Definition of deployment.
https://www.lexico.com/definition/deployment
Pääkkönen, P., & Pakkala, D. (2020). Extending reference architecture of big data systems
towards machine learning in edge computing environments. Journal of Big Data, 7(1).
https://doi.org/10.1186/s40537-020-00303-y
Patruno, L. (2019). The ultimate guide to model retraining. https://mlinproduction.com/model-
retraining/
Patruno, L. (2020). The ultimate guide to deploying machine learning models. ML in
Production. https://mlinproduction.com/deploying-machine-learning-models/;
https://mlinproduction.com/what-does-it-mean-to-deploy-a-machine-learning-model-
deployment-series-01/; https://mlinproduction.com/software-interfaces-for-machine-
learning-deployment-deployment-series-02/; https://mlinproduction.com/batch-inference-
for-machine-learning-deployment-deployment-series-03/; https://mlinproduction.com/the-
challenges-of-online-inference-deployment-series-04/; https://mlinproduction.com/online-
inference-for-ml-deployment-deployment-series-05/; https://mlinproduction.com/model-
registries-for-ml-deployment-deployment-series-06/; https://mlinproduction.com/testing-
machine-learning-models-deployment-series-07/; https://mlinproduction.com/ab-test-ml-
models-deployment-series-08/
Patzak, G. (1982). Systemtechnik - Planung komplexer innovativer Systeme: Grundlagen,
Methoden, Techniken. Springer.
Pennington, J. (2019). The eight phases of a devops pipeline.
https://medium.com/taptuit/the-eight-phases-of-a-devops-pipeline-fda53ec9bba
Perrault, R., Shoham, Y., Brynjolfsson, E., Clark, J., Etchemendy, J., Grosz, B., Lyons, T., &
Manyika, J. (2019). The ai index 2019 annual report. AI Index Steering Committee,
Human-Centered AI Institute.
Pilarski, S., Staniszewski, M., Bryan, M., Villeneuve, F., & Varró, D. (2021). Predictions-on-
chip: Model-based training and automated deployment of machine learning models at
runtime. Software and Systems Modeling. Advance online publication.
https://doi.org/10.1007/s10270-020-00856-9
Pinhasi, A. (2020). Deploying machine learning models to production — inference service
architecture patterns. https://medium.com/data-for-ai/deploying-machine-learning-models-
to-production-inference-service-architecture-patterns-bc8051f70080
Posta, C. (2015). Blue-green deployments, a/b testing, and canary releases.
https://blog.christianposta.com/deploy/blue-green-deployments-a-b-testing-and-canary-
releases/
V Bibliography 87
Quintanilla, L., Schonning, N., Kershaw, N., Victor, Y., Wenzel, M., Pratschner, S.,
Potapenko, M., Gronlund, C. J., Alexander, J., Kulikov, P., & Dugar, A. (2019). Machine
learning tasks in ml.Net. Microsoft. https://docs.microsoft.com/en-us/dotnet/machine-
learning/resources/tasks
Rao, A., Likens, S., & Shehab, M. (2019). 2019 ai predictions: six ai priorities you can’t afford
to ignore. PwC US. https://www.pwc.com/us/en/services/consulting/library/artificial-
intelligence-predictions-2019.html
Robinson, S. (2006). Conceptual modeling for simulation: Issues and research requirements.
In Proceedings of the 2006 winter simulation conference (pp. 792–800). IEEE.
https://doi.org/10.1109/WSC.2006.323160
Robinson, S. (2014). Simulation: The practice of model development and use (2nd edition).
Palgrave Macmillan.
Rodríguez, M. Á., Alemany, M. M. E., Boza, A., Cuenca, L., & Ortiz, Á. (2020). Artificial
intelligence in supply chain operations planning: Collaboration and digital perspectives. In
L. M. Camarinha-Matos, H. Afsarmanesh, & A. Ortiz (Eds.), IFIP Advances in Information
and Communication Technology. Boosting Collaborative Networks 4.0 (Vol. 598, pp. 365–
378). Springer International Publishing. https://doi.org/10.1007/978-3-030-62412-5_30
Royce, W. W. (1970). Managing the development of large software systems: Concepts and
techniques. Proc. IEEE WESTCON, Los Angeles, 1–9.
Rychener, L., Montet, F., & Hennebert, J. (2020). Architecture proposal for machine learning
based industrial process monitoring. Procedia Computer Science, 170, 648–655.
https://doi.org/10.1016/j.procs.2020.03.137
Saha, P., & Bose, A. (2021). Mlops: Model monitoring 101.
https://www.kdnuggets.com/2021/01/mlops-model-monitoring-101.html
Salminen, J., Milenković, M., & Jansen, B. J. (2017). Problems of data science in
organizations: An explorative qualitative analysis of business professionals’ concerns. In
E. Y. Li & K. N. Shen (Chairs), The 17th International Conference on Electronic Business,
Dubai.
Saltz, J. (2020). Crisp-dm is still the most popular framework for executing data science
projects. https://www.datascience-pm.com/crisp-dm-still-most-popular/
Samiullah, C. (2019). How to deploy machine learning models.
https://christophergs.com/machine%20learning/2019/03/17/how-to-deploy-machine-
learning-models/
Samiullah, C. (2020). Monitoring machine learning models in production.
https://christophergs.com/machine%20learning/2020/03/14/how-to-monitor-machine-
learning-models/
Samuylova, E. (2020). Machine learning in production: Why you should care about data and
concept drift. https://towardsdatascience.com/machine-learning-in-production-why-you-
should-care-about-data-and-concept-drift-d96d0bc907fb
V Bibliography 88
Sarkar, D., Bali, R., & Sharma, T. (2018). Practical machine learning with Python: A problem-
solver's guide to building real-world intelligent systems. Apress.
Sato, D., Wider, A., & Windheuser, C. (2019). Continuous delivery for machine learning:
automating the end-to-end lifecycle of machine learning applications.
https://martinfowler.com/articles/cd4ml.html
Schmitt, J., Bönig, J., Borggräfe, T., Beitinger, G., & Deuse, J. (2020). Predictive model-
based quality inspection using machine learning and edge cloud computing. Advanced
Engineering Informatics, 45, 101101. https://doi.org/10.1016/j.aei.2020.101101
Schmitt, R. H., Kurzhals, R., Ellerich, M., Nilgen, G., Schlegel, P., Dietrich, E., & Krauß, J.
(2020). Predictive quality – data analytics in produzierenden unternehmen. In Internet of
production - turning data into value (pp. 226–253).
Schorr, S., Möller, M., Heib, J., Fang, S., & Bähre, D. (2020). Quality prediction of reamed
bores based on process data and machine learning algorithm: A contribution to a more
sustainable manufacturing. Procedia Manufacturing, 43, 519–526.
https://doi.org/10.1016/j.promfg.2020.02.180
Schwaber, K., & Sutherland, J. (2020). The scrum guide: the definitive guide to scrum: The
rules of the game. https://scrumguides.org/docs/scrumguide/v2020/2020-Scrum-Guide-
US.pdf#zoom=100
Scrum.org. (2020). What is scrum? https://www.scrum.org/resources/what-is-scrum
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V.,
Young, M., & Dennison, D. (2015). Hidden technical debt in machine learning systems. In
C. Cortes, D. D. Lee, M. Sugiyama, & R. Gernett (Eds.), Proceedings of the 28th
international conference on neural information processing systems (2nd ed., pp. 2503–
2511). MIT Press, Cambridge, MA, USA.
https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf
Serban, A., van der Blom, K., Hoos, H., & Visser, J. (2020). Adoption and effects of software
engineering best practices in machine learning. In Proceedings of the 14th acm / ieee
international symposium on empirical software engineering and measurement (esem)
(pp. 1–12). ACM. https://doi.org/10.1145/3382494.3410681
Shaik, N. (2019). Unpacking the complexity of machine learning deployments.
https://predera.com/unpacking-the-complexity-of-machine-learning-deployments/
Shalev-Shwartz, S., & Ben-David, S. (2019). Understanding machine learning: From theory
to algorithms (12th printing). Cambridge University Press.
Shrivastava, T. (2016). 8 reasons why analytics / machine learning models fail to get
deployed. https://www.analyticsvidhya.com/blog/2016/05/8-reasons-analytics-machine-
learning-models-fail-deployed/
Simek, P., & Slomkova, K. (2021). Automated deployment.
https://developerexperience.io/practices/automated-deployment
V Bibliography 89
Singh, P. (2021). Deploy Machine Learning Models to Production. Apress.
https://doi.org/10.1007/978-1-4842-6546-8
Singh Bisen, V. (2019). These are the reasons why more than 95% ai and ml projects fail.
https://medium.com/vsinghbisen/these-are-the-reasons-why-more-than-95-ai-and-ml-
projects-fail-cd97f4484ecc
Sridharan, C. (2018). Distributed Systems Observability: A Guide to Building Robust
Systems. O’Reilly Media. https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-
Systems-Observability-eBook.pdf
Stachowiak, H. (1973). Allgemeine Modelltheorie. Springer.
Stewart, M. (2019). Understanding dataset shift.
https://towardsdatascience.com/understanding-dataset-shift-f2a5a262a766
Subasi, A. (2020). Practical machine learning for data analysis using Python. Elsevier;
Academic Press.
Svetashova, Y., Zhou, B., Pychynski, T., Schmidt, S., Sure-Vetter, Y., Mikut, R., &
Kharlamov, E. (2020). Ontology-enhanced machine learning: A bosch use case of welding
quality monitoring. In J. Z. Pan, V. Tamma, C. d’Amato, K. Janowicz, B. Fu, A. Polleres,
O. Seneviratne, & L. Kagal (Eds.), Lecture Notes in Computer Science. The Semantic
Web – ISWC 2020 (Vol. 12507, pp. 531–550). Springer International Publishing.
https://doi.org/10.1007/978-3-030-62466-8_33
Talby, D. (2019). Why machine learning models crash and burn in production.
https://www.forbes.com/sites/forbestechcouncil/2019/04/03/why-machine-learning-
models-crash-and-burn-in-production/#64b9b84c2f43
Thomas, J., & Mewald, C. (2019). Productionizing machine learning: From deployment to
drift detection. https://slacker.ro/2019/09/18/productionizing-machine-learning-from-
deployment-to-drift-detection/
Tremel, E. (2017). Six strategies for application deployment.
https://thenewstack.io/deployment-strategies/
Treveil, M., & Dataiku Team. (2020). Introducing MLOps. O'Reilly Media, Inc.
Turck, M. (2020). Resilience and vibrancy: The 2020 data & ai landscape.
https://mattturck.com/data2020/
Turetskyy, A., Wessel, J., Herrmann, C., & Thiede, S. (2021). Battery production design
using multi-output machine learning models. Energy Storage Materials, 38, 93–112.
https://doi.org/10.1016/j.ensm.2021.03.002
Ulrich, H., Dyllick, T., & Probst, G. (1984). Management. Haupt.
Vafeiadis, T., Ioannidis, D., Ziazios, C., Metaxa, I. N., & Tzovaras, D. (2017). Towards robust
early stage data knowledge-based inference engine to support zero-defect strategies in
manufacturing environment. Procedia Manufacturing, 11, 679–685.
https://doi.org/10.1016/j.promfg.2017.07.167
V Bibliography 90
Washizaki, H., Uchida, H., Khomh, F., & Gueheneuc, Y.‑G. (2019). Studying software
engineering patterns for designing machine learning systems. In 2019 10th international
workshop on empirical software engineering in practice (iwesep) (pp. 49–495). IEEE.
https://doi.org/10.1109/IWESEP49350.2019.00017
Waterworth, S. (2019). Observability vs. Monitoring.
https://www.instana.com/blog/observability-vs-monitoring/
Watts, S., & Raza, M. (2019). Saas vs paas vs iaas: What’s the difference & how to choose.
https://www.bmc.com/blogs/saas-vs-paas-vs-iaas-whats-the-difference-and-how-to-
choose/
Wehrstein, L. (2020). Crisp-dm ready for machine learning projects.
https://towardsdatascience.com/crisp-dm-ready-for-machine-learning-projects-
2aad9172056a
Wheeler, S. (2019). What does it mean to “productionize” data science?
https://towardsdatascience.com/what-does-it-mean-to-productionize-data-science-
82e2e78f044c
Wohlin, C. (2014). Guidelines for snowballing in systematic literature studies and a
replication in software engineering. In M. Shepperd, T. Hall, & I. Myrtveit (Eds.),
Proceedings of the 18th international conference on evaluation and assessment in
software engineering - ease '14 (pp. 1–10). ACM Press.
https://doi.org/10.1145/2601248.2601268
Yong, B. X., & Brintrup, A. (2020). Multi agent system for machine learning under uncertainty
in cyber physical manufacturing system. In T. Borangiu, D. Trentesaux, P. Leitão, A. Giret
Boggino, & V. Botti (Eds.), Studies in Computational Intelligence. Service Oriented,
Holonic and Multi-agent Manufacturing Systems for Industry of the Future (Vol. 853,
pp. 244–257). Springer International Publishing. https://doi.org/10.1007/978-3-030-27477-
1_19
Zahrani, E. G., Hojati, F., Daneshi, A., Azarhoushang, B., & Wilde, J. (2020). Application of
machine learning to predict the product quality and geometry in circular laser grooving
process. Procedia CIRP, 94, 474–480. https://doi.org/10.1016/j.procir.2020.09.167
Zeiser, A., van Stein, B., & Bäck, T. (2021). Requirements towards optimizing analytics in
industrial processes. Procedia Computer Science, 184, 597–605.
https://doi.org/10.1016/j.procs.2021.03.074
Zinkevich, M. (2016). Rules of Machine Learning: Best Practices for ML Engineering.
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf
Zwicky, F., & Wilson, A. G. (Eds.) (1967). New methods of thought and procedure. Springer-
Verlag.
Icons made by Freepik from www.flaticon.com.
VI Budgeting 91
VI Budgeting
As an addition to the written document, this section covers considerations regarding the
budget when developing and executing a concrete project to deploy an ML model for
predictive quality into production.
Cost of Development of Project
With respect to developing the project, all necessary activities for creating this document
were executed by the author himself. Figure 1 shows the time plan for the whole thesis as
well as the proportions of the activities. An analysis of the current state and problem in
practice comprises 10 % of the total effort. Introducing theoretical fundamentals and
evaluating existing approaches are responsible for 30 % respectively 40 % of the effective
working time. The remaining 20 % are used to validate the methodology and implement a
use case.
Figure 1: Time plan of thesis
The only costs associated with the development of the project is the labor of the author.
Applying an hourly wage of 14 €/h valid for students with a completed bachelor’s degree, the
time plan can be translated into costs as indicated in Table 1. For the total of 300 effective
working hours, the total cost for development of the project is 4200 €.
Cost of Execution of Project
When executing the project and deploying an ML model for predictive quality in production,
there are costs for developing and operating the ML software. Depending on the size of the
company, estimated costs are listed in Table 2.
By summing up the development and execution costs, the total project cost is determined. In
the next step, the total cost is compared to the estimated benefits.
2021
Jan Feb Mar Apr May
Analysis of current state and problem in practice
Theoretical fundamentals and evaluation existing approaches
Outline and development of methodology
Validation and implementation of use case
Jun Jul
Start End
VI Budgeting 92
Table 1: Estimated costs for development of project
Activity Effective hours Associated cost
Analysis of current state and
problem in practice 30 h 420 €
Theoretical fundamentals and
evaluation existing approaches 90 h 1260 €
Outline and development of
methodology 120 h 1680 €
Validation and implementation
of use case 60 h 840 €
Total 300 h 4200 €
Table 2: Estimated costs for execution of project
Cost factors Small company Large company
ML software development
(~ 25 % of total cost) 35.000 € 150.000 €
ML software operation
(~ 75 % of total cost) 105.000 € 450.000 €
Total 140.000 € 600.000 €
Sources: https://www.spheregen.com/cost-of-software-development/,
https://www.lookfar.com/blog/2016/10/21/software-maintenance-understanding-and-estimating-costs/
Benefit of Execution of Project
The value that can be created through the deployment of ML models for predictive quality
highly depends on the respective use case. A real-life case study from the production of
bladed disks (BLISKs), which are important components of turbines such as aircraft jet
engines, serves as an example. Predictive quality measures are estimated to bring the high
rework rate of 25 % down to 15 %. In the case study, annual savings in production of
27.000.000 € are estimated as shown in Table 3.
VI Budgeting 93
Table 3: Estimate savings in BLISK manufacturing
Cost reduction per item Average number of items per
day and factory
Annual savings per factory
3600 € 40 27.000.000 €
Source: https://www.ericsson.com/en/reports-and-papers/consumerlab/reports/5g-business-value-to-
industry-blisk
Advantageousness of Project
As an important note, this cost reduction is not achieved solely by the deployment. Deploying
a model only represents the last step and is preceded by activities such as preparation the
production equipment and building the model itself. In order to realize the savings, the cost of
employing data scientists to build the model and necessary investments in devices for data
acquisition such as sensors must be taken into account. Consequently, the direct economic
impact of deployment is very difficult to measure as the success of the implementation of
predictive quality depends on it.
In the presented example, a comparison between the magnitude of yearly savings of
27.000.000 € with the total cost of the project amounting to 604.200 € provides evidence for
the advantageousness of the project. For other use cases, a detailed analysis of the costs of
the whole life cycle of an ML application including phases before the deployment need to be
considered in the decision.
VII Appendix 94
VII Appendix
Through workshops and the practical implementation, the methodology was validated. In this
appendix, the procedure applied in the workshops is presented and relevant elements of the
source code of the implemented software are explained.
A.1. Workshops
For the workshops with production experts, the tool Miro was used which allows to
collaborate between the participants on a shared whiteboard in the browser. The participants’
input was collected by asking the questions below. Answers were written on the colored
notes and then posted directly to the corresponding section of the methodology.
VII Appendix 95
A.2. Source Code
With respect to the source code, first the directory tree of the application folder is shown as
an overview. Then, the source code of the application’s most relevant components is
presented. The shown excerpts are responsible for enabling the main functionality of the
programmed service.
A.2.1. Directory Tree
The directory tree contains the folders with different functionalities. All elements responsible
for running the application locally on the computer and displaying it in the desired design in
the browser are located in the API folder. Data used for predictions, monitoring and training
are saved in the data folder. At last, the ML Model folder contains all scripts and objects to
create a model and make predictions with it.
Application
| configuration.py
|
+---API
| | app.py
| |
| +---static
| | +---css
| | | style.css
| | |
| | \---images
| | favicon-16x16.png
| | favicon-32x32.png
| | favicon.ico
| |
| \---templates
| index.html
| layout.html
| monitoring.html
| prediction_input.html
| prediction_output.html
|
+---Data
| +---Production
| | 2008-01-08.csv
| | 2008-01-09.csv
| | …
| | 2008-12-09.csv
| | 2008-12-10.csv
| |
| \---Training
| uci-secom.csv
|
\---ML_model
| build_pipeline.py
| predict.py
| preprocessors.py
| train.py
VII Appendix 96
|
\---Objects
constant_columns.pkl
mostly_empty_columns.pkl
pca.pkl
rfe.pkl
scaler.pkl
trained_model.pkl
trained_pipeline.pkl
A.2.2. API
app.py
The source code of the app.py script was already shown in chapter 7.2.2.
layout.html
In app.py, a request is answered by returning an HTML file. All HTML files are based on the
same template which is defined in layout.html. This file defines the look and feel of the
application by arranging and designing the shown objects.
<!DOCTYPE html>
<html>
<!-- Head -->
<head>
<title>SECOM Deployment</title>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- title of web page -->
<title>SECOM Deployment</title>
<!-- icons -->
<link rel="icon" href="{{url_for('static', filename='images/favicon-
32x32.png')}}" sizes=32x32>
<link rel="icon" href="{{url_for('static', filename='images/favicon-
16x16.png')}}" sizes=16x16>
<!-- css stylesheet for design of elements -->
<link rel="stylesheet" href="{{ url_for('static', filename='css/style.css'
) }}">
</head>
<!-- Body -->
<body>
VII Appendix 97
<header class="header-basic">
<div class="header-limiter">
<!-- logo in the left corner can be used to get to the home page -->
<h1><a href="{{ url_for('.home') }}">SECOM Deployment</a></h1>
<!-- links to other endlinks on the right side of the page -->
<nav>
<a href="{{ url_for('.home') }}">Home</a>
<a href="{{ url_for('.get_prediction') }}">Prediction</a>
<a href="{{ url_for('.get_evaluation') }}">Monitoring</a>
<!--
<a href="#">tbd</a>
-->
</nav>
</div>
</header>
<div class="container">
{% block content %}
<!-- space for the content of html files using this layout-->
{% endblock %}
</div>
</body>
</html>
VII Appendix 98
prediction_output.html
As an example of the HTML files, the code for displaying the prediction outputs is shown
below. It uses the layout template and shows the table of predictions made. By means of
JavaScript, the production fails are highlighted in red in the table.
{% extends "layout.html" %}
{% block content %}
<div class="menu">
<h1>Prediction > Results</h1>
<!-- show time and date of prediction -->
<p>{{pred_to_print}}</p>
<!-- display table with prediction results -->
<p>{{ table|safe }}</p>
</div>
<script>
// make reference to the table object
var table = document.getElementById('result_table');
// go through table and highlight fails
for (var r = 0, n = table.rows.length; r < n; r++) {
for (var c = 0, m = table.rows[r].cells.length; c < m; c++) {
if(table.rows[r].cells[c].innerHTML == "Fail")
{
table.rows[r].style.backgroundColor = "red";
}
}
}
</script>
{% endblock %}
A.2.3. ML Model
The shown code follows the workflow from training and pipeline building to prediction-making
and monitoring. First, the model is trained in train.py, then a pipeline is created in
build_pipeline.py which then is used to make and monitor predictions in the predict.py script.
VII Appendix 99
train.py
In order to train the model, a data scientist defined the necessary steps as an input for the
deployment. Thus, a big part of the code can be taken from the modeling. In train.py, the
defined steps are executed one after another and objects are exported to be used later on.
# imports
import os
import pandas as pd
import numpy as np
from math import sqrt
import joblib
# imports for sklearn functionality
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
# imports for data balancing
from imblearn.over_sampling import SMOTE, SMOTENC
from imblearn.under_sampling import RandomUnderSampler
# ignore warnings
import warnings
warnings.simplefilter(action='ignore')
# import setting from configuration
import configuration
from configuration import OBJECT_FOLDER as objectfolder
def get_data():
# load training data file
filepath = configuration.TRAINING_DATA_FILE
return pd.read_csv(filepath)
def adjust_columns(data):
# drop duplicates
data.drop_duplicates(inplace=True, subset=["Time"])
# set time stamp as index
data.set_index(keys=["Time"], inplace=True)
VII Appendix 100
# drop mostly empty columsn
mostly_empty_columns=data.columns[data.isnull().mean()>0.5]
data.drop(mostly_empty_columns, axis=1, inplace=True)
#interpolate
data.interpolate(inplace=True)
data.fillna(method='bfill', inplace=True)
# drop constant features
isConstant = data.nunique() == 1
constantColumns = data.columns[isConstant]
data.drop(constantColumns, axis = 1, inplace=True)
# export objects
joblib.dump(mostly_empty_columns, os.path.join(objectfolder, 'mostly_emp
ty_columns.pkl'))
joblib.dump(constantColumns, os.path.join(objectfolder, 'constant_c
olumns.pkl'))
return data
def scale_data(data):
# divide set into numeric and target
data_numeric = data[data.columns[data.columns != 'Pass/Fail']]
data_target = data[data.columns[data.columns == 'Pass/Fail']]
# create a scaler, train scaler and transform data
scaler = StandardScaler()
scaled = scaler.fit_transform(data_numeric)
# create a new DataFrame with the standardized data and with the original
labels
data_scaled = pd.DataFrame(data = scaled, columns=data_numeric.columns)
# put back the non numeric variable
data_target.reset_index(inplace=True)
data_scaled['Pass/Fail'] = data_target['Pass/Fail']
# export trained scaler
joblib.dump(scaler, os.path.join(objectfolder, 'scaler.pkl'))
return data_scaled
def reduce_dimension(data):
# get the numerical data
data_numeric = data[data.dtypes[data.dtypes == 'float64'].index]
VII Appendix 101
# Execute PCA so that 95% of variance are explained
pca = PCA(.95, random_state=42)
principal_components = pca.fit_transform(data_numeric)
data_principal = pd.DataFrame(data = principal_components)
# save output of PCA as array
x = np.array(data_principal)
y = np.array(data['Pass/Fail'])
# features are selected via linear regression
estimator = LinearRegression()
rfe = RFE(estimator)
selector = rfe.fit(x, y)
# reduce data frame to only the selected variables
selected_features = data_principal.columns[selector.support_]
# reduce variables
data_principal_reduced = data_principal[selected_features]
data_principal_reduced["Pass/Fail"]=data["Pass/Fail"]
# save PCA and RFE as objects for later
joblib.dump(pca, os.path.join(objectfolder, 'pca.pkl'))
joblib.dump(rfe, os.path.join(objectfolder, 'rfe.pkl'))
return data_principal_reduced
def undersample_data(data):
# train test split
train, test = train_test_split(data, test_size = 0.3, random_state=42)
# separate data set into features and target
X = data.loc[:, data.columns != 'Pass/Fail']
y = data.loc[:, data.columns == 'Pass/Fail']
# take majority class and reduce instances, the minority class is not chan
ged
rus = RandomUnderSampler(sampling_strategy='majority', random_state=42)
# execute resampling
X_rus, y_rus = rus.fit_resample(X, y)
#j oining features and target to one dataframe
y_rus.columns = ['Pass/Fail']
train_undersampled = X_rus.join(y_rus)
train_undersampled = train_undersampled.sample(frac=1).reset_index(drop=Tr
ue)
VII Appendix 102
# select randomly and scramble rows
train_undersampled = train_undersampled.append(train.sample(frac=1)[0:500]
, sort=False)
train_undersampled = train_undersampled.sample(frac=1).reset_index(drop=Tr
ue)
return train_undersampled
def fit_classifer(train):
# divide train and test set into X and y each
X_train = np.array(train.loc[:,train.columns !='Pass/Fail'])
y_train = np.array(train.loc[:,train.columns =='Pass/Fail'])
# create algorithm and train it
classifier = RandomForestClassifier(n_estimators = 500, max_depth = 20, ra
ndom_state = 42)
classifier.fit(X_train, y_train.ravel())
# export trained model
joblib.dump(classifier, os.path.join(objectfolder, 'trained_model.pkl'))
def execute_training():
print("Training started.")
# preprocessing steps
data = reduce_dimension(scale_data(adjust_columns(get_data())))
# undersampling of train set
train = undersample_data(data)
# fit model
fit_classifer(train)
# objetcs are saved and exported within functions
print("Training finished.")
# main method
if __name__ == '__main__':
execute_training()
VII Appendix 103
build_pipeline.py
Based on the exported objects from training, a pipeline is built containing all data
preprocessing steps and the trained ML algorithm. Said pipeline is exported as an object for
the next step.
# imports
import os
import joblib
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
from configuration import OBJECT_FOLDER as objectfolder
# import auxiliary methods necessary for pipeline
from ML_model import preprocessors as pp
# load objects from training
mostly_empty_columns = joblib.load(filename=os.path.join(objectfolder, "mostl
y_empty_columns.pkl"))
constant_columns = joblib.load(filename=os.path.join(objectfolder, "const
ant_columns.pkl"))
scaler_imported = joblib.load(filename=os.path.join(objectfolder, "scale
r.pkl"))
pca_imported = joblib.load(filename=os.path.join(objectfolder, "pca.p
kl"))
rfe_imported = joblib.load(filename=os.path.join(objectfolder, "rfe.p
kl"))
model_imported = joblib.load(filename=os.path.join(objectfolder, "train
ed_model.pkl"))
# define steps of pipeline
pipeline = Pipeline(
[
('remove_mostly_empty_columns', pp.RemoveMostlyEmptyColumns(variables_
to_drop=mostly_empty_columns)),
('interpolate_missing_values', pp.InterpolateMissingValues()),
('remove_constant_features', pp.RemoveConstantFeatures(variables_to_dr
op=constant_columns)),
('Standard_Scaler', scaler_imported),
('PCA', pca_imported),
('RFE', rfe_imported),
('Random_Forest', model_imported)
]
)
VII Appendix 104
# export pipeline
def dump_pipeline():
joblib.dump(pipeline, filename=os.path.join(objectfolder, "trained_pipelin
e.pkl"))
predict.py
For predictions, the trained pipeline is imported and a method is defined which returns
predictions based on a given data input. This method is called by app.py when the user
requests a prediction or evaluation metrics for monitoring purposes.
# imports
import os
import pandas as pd
import numpy as np
import joblib
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_
score
import configuration
from configuration import OBJECT_FOLDER as objectfolder
# import trained pipeline
trained_pipeline = joblib.load(os.path.join(objectfolder, "trained_pipeline.pk
l"))
def get_prediction_df(input_data):
# convert input data into data frame
df = pd.DataFrame(input_data)
# set time stamp as index
df.set_index("Time", inplace=True)
df = df.astype(np.float32)
# save time stamps for traceability
time_stamps = df.index.tolist()
# create product IDs
product_IDs = []
for time_stamp in time_stamps:
id_part1 = time_stamp[2:4]
id_part2 = time_stamp[5:7]
id_part3 = time_stamp[8:10]
id_part4 = "{0:0=4d}".format(time_stamps.index(time_stamp))
id_complete = str(id_part1) + str(id_part2) + str(id_part3) + "_" + st
r(id_part4)
product_IDs.append(id_complete)
VII Appendix 105
# get prediction from pipeline
predictions = trained_pipeline.predict(df)
# save predictions with additional information
df_results = pd.DataFrame(
{'Time Stamp': time_stamps,
'Prediction': predictions,
'Product ID': product_IDs
})
# replace numerical values by human-readable ones
df_results["Prediction"].replace(to_replace=-
1, value="Pass", inplace=True)
df_results["Prediction"].replace(to_replace=1, value="Fail", inplace=True)
return df_results
def get_metrics_scores():
# insert holdout data set with target variable
data = pd.read_csv(configuration.HOLDOUT_DATA_FILE)
# set Time as index
data = data.set_index("Time", inplace=False)
# save correct labels
y = data["Pass/Fail"]
# drop label as model requires unlabeled data
X = data.drop('Pass/Fail', axis=1, inplace = False)
# execute prediction
y_pred = trained_pipeline.predict(X)
# calculate accuracy, precision, recall and F1-Score
scores = np.array([accuracy_score(y, y_pred), precision_score(y, y_pred),
recall_score(y, y_pred), f1_score(y, y_pred)])
# round values to 4 decimals
scores_rounded = np.around(scores, decimals =4)
# return rounded scores
return scores_rounded
def get_version_number():
# return current version of pipeline
return configuration.VERSION_NUMBER
VII Appendix 106
A.2.4. Data
With regard to data, all data is ingested in form of CSV files. The example below shows
comma-separated values from sensor data of one specific day, which are used as an input
for prediction-making. Each line contains the 590 sensor values for one instance of data.
2008-10-15.csv