PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING
USING MACHINE LEARNING TECHNIQUES
by
REIMAN L. RABBANI
(Under the Direction of Khaled Rasheed)
ABSTRACT
In this thesis, several Machine Learning (ML) methods and hybrid algorithms that were
developed are applied for the first time to predict microbial activity during composting.
Modeling biological activities for an inadequately understood domain is a difficult task. This
thesis evaluates, compares, and analyzes the improved results of the models created and the
methods used.
The results indicate with statistical significance that hybridizing an eager learner with a
lazy one improves learning performance in this domain. Lazy-eager hybrids can form complex,
irregular hypotheses. They are suitable because the expressive power of the eager learner is
significantly enhanced due to their ability to represent the target function by combining several
complex locally approximated hypotheses. The study also showed hybrid rule-based methods
and trees to be good performers.
INDEX WORDS: Composting, Microbial Activity, Machine Learning, Lazy Eager Learner,
Model Tree, Hybrid Learner, Biosolids, Municipal Solid Waste, MSW, Ethanol, Modeling
PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING
USING MACHINE LEARNING TECHNIQUES
by
REIMAN L. RABBANI
B.S. in Computer Science, King College, 2002
A Thesis Submitted to the Graduate Faculty of The University of Georgia in Partial Fulfillment
of the Requirements for the Degree
MASTER OF SCIENCE
ATHENS, GEORGIA
2006
© 2006
Reiman L. Rabbani
All Rights Reserved
PREDICTING MICROBIAL ACTIVITY FOR COMPOSTING
USING MACHINE LEARNING TECHNIQUES
by
REIMAN L. RABBANI
Major Professor: Khaled Rasheed
Committee: Walter D. Potter Ronald W. McClendon
Electronic Version Approved: Maureen Grasso Dean of the Graduate School The University of Georgia August 2006
DEDICATION
To my beloved parents and my two wonderful brothers.
iv
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude and respect to Dr. Rasheed for providing me
with guidance, ideas, encouragement and patience. I have thoroughly enjoyed his classes and
teaching style, and he has always been available for my many questions and regular visits to his
office. Having taken my first class and first research course at UGA with Dr. Potter, he
passionately and diligently introduced me to the realm of AI where it was a pleasure to hunt for
snakes on many sleepless nights. Thank you for your sincerity in teaching and for having high
expectations from your students. I am very thankful to Dr. McClendon, who introduced me to
this project, and has been monumental in guiding me to develop this thesis. Dr. Potter’s and Dr.
McClendon’s class on Computational Intelligence is probably the best two-for-one class package
at UGA and I am thankful to them for being on my committee. Many thanks to Dr. K. C. Das
from the Agricultural Engineering Department at UGA for his suggestions and help with the
composting experiment and data. Heartfelt thanks to Dr. Arabnia, without whom the UGA CS
program wouldn’t have been the same – thank you for believing in me. His charisma, positive
attitude, and genuine care for students have made so many of our UGA experiences possible. I
am appreciative of Dr. John Miller for his great teaching style and helpful attitude. I would also
like to thank Boseon Byeon who has helped me with parts of this research in a class project.
Thanks to the friendly and jovial staff of the CS department, especially Claudia, Jean and
Elizabeth, who were always eager to help with a smile. My warmest thanks goes to my family -
you guys have always been there for me in every aspect of my life. Last but not least it was a
pleasure to have made so many wonderful friends at UGA, thank you all.
v
TABLE OF CONTENTS
Page
ACKNOWLEDGEMENTS.............................................................................................................v
CHAPTER
1 INTRODUCTION .........................................................................................................1
1.1 MACHINE LEARNING...................................................................................2
1.2 MUNICIPAL SOLID WASTE, BIOSOLIDS, AND ETHANOL....................3
1.3 COMPOSTING AND ITS MODELING ..........................................................5
1.4 THESIS MOTIVATION...................................................................................6
1.5 RELATED WORK............................................................................................8
1.6 THESIS OBJECTIVES...................................................................................10
2 ANALYZING THE DATA .........................................................................................11
2.1 COMPOSTING EXPERIMENT ....................................................................11
2.2 DIFFICULTIES IN LEARNING....................................................................12
3 MACHINE LEARNING METHODOLOGIES ..........................................................19
3.1 BACKGROUND.............................................................................................19
3.2 EAGER AND LAZY LEARNERS.................................................................21
3.3 LINEAR REGRESSION.................................................................................23
3.4 k-NEAREST NEIGHBOR..............................................................................24
3.5 LOCALLY WEIGHTED LINEAR REGRESSION (LWR) ..........................26
3.6 RADIAL BASIS FUNCTION NETWORKS (RBF) ......................................27
vi
3.7 SUPPORT VECTOR MACHINES (SVM) ....................................................28
3.8 REGRESSION TREES ...................................................................................29
3.9 MODEL TREES..............................................................................................31
3.10 ARTIFICIAL NEURAL NETWORKS (ANN) .............................................33
3.11 HYBRID METHODS ....................................................................................35
3.12 ADAPTIVE NEURO-FUZZY INFERENCING SYSTEM (ANFIS) ...........36
3.13 COMBINING MODELS - ENSEMBLE APPROACHES............................36
4 IMPEMENTATION DETAILS...................................................................................39
4.1 LAZY METHODS ..........................................................................................39
4.2 EAGER METHODS .......................................................................................40
4.3 HYBRID METHODS .....................................................................................43
4.4 HYBRID NEURO-FUZZY SYSTEM............................................................43
5 EVALUTION, RESULTS AND ANALYSIS.............................................................45
5.1 MODEL EVALUATION METRICS .............................................................46
5.2 MODEL DEVELOPMENT & EVALUATION .............................................47
5.3 EAGER LEARNING METHOD RESULTS..................................................51
5.4 LAZY LEARNING METHOD RESULTS ....................................................66
5.5 COMBINING MODELS - ENSEMBLE RESULTS......................................70
5.6 HYBRID METHOD RESULTS .....................................................................72
5.7 EVALUATING MACHINE LEARNING SCHEMES ..................................75
5.8 ANALYSIS .....................................................................................................84
6 CONCLUSION AND FUTURE WORK ....................................................................87
REFERENCES ..............................................................................................................................90
vii
CHAPTER 1
INTRODUCTION
The widespread availability of computational resources at the present time has led to the
deployment of computers to help in every aspect of modern society. The next natural progression
of this technology would be to enable it to automatically learn ways to help us move towards our
goals. This is where Machine Learning (ML) comes into play. It is a relatively young field in
computer science that has had far-reaching practical and commercial implications. According to
Tom Mitchell (1997), ML is a broad multidisciplinary field drawing on concepts from Artificial
Intelligence (AI), probability, statistics, information theory, philosophy, biology, cognitive
science, computational complexity and many other disciplines. It entails the study of algorithms
or techniques that allow the computer to “learn”, i.e., automatically improve its performance
through experience and/or knowledge. Machine Learning has a wide spectrum of applications,
for example, some very successful areas are search engines, bioinformatics, stock market
analysis, 3D object, speech and handwriting recognition, game playing, and robot locomotion
just to mention a few. They are well suited for poorly understood domains where humans lack
the knowledge needed to develop an algorithm (e.g., biological processes); for domains where
the program must dynamically adapt to changing conditions (e.g., adapting to the changing
interests of individuals); and for automatically finding valuable hidden patterns in large
databases (Mitchell 1997). This thesis concentrates on the application and subsequent
comparison of several ML methods and their hybrids for modeling biosolids composting, which
is an inadequately understood biological process.
1
1.1 MACHINE LEARNING
Machine Learning algorithms can be thought of as search algorithms that traverse the set
of all possible hypotheses H (or concepts to be learned) in order to find the unknown concept
underlying the training instances. The concept, c to be learned by the machine is commonly
referred to as the target concept, which can be a skill, pattern, model, relation, function, rule or
some other form of knowledge. In other words, ML focuses on inducing general
functions/patterns/rules from specific training examples. For example, a simple target concept
involving only two descriptors/attributes to identify a car could be, “If object has 4 wheels and
can move, then object is a car”, which could also be expressed as a 3-tuple, (numWheels = 4,
canMove = true, object = car). The most basic algorithm searches through the space of all
possible hypotheses H to find one that is consistent with the training data/knowledge presented.
A hypothesis h is consistent with a set of training examples D, denoted as Consistent(h, D), if
and only if h(x) = c(x) for each example <x, c(x)> in D, the set of training examples. If our goal
is to consider the number of wheels and the ability to move of an object to identify whether it is a
car, then the hypothesis space is quite small. In this case, given a few positive and negative
examples, it would take a simple algorithm, e.g. List-Then-Eliminate (Mitchell 1997) described
below, to learn c.
The List-Then-Eliminate Algorithm is as follows:
1. S = set of all hypotheses in H
2. For each training example, <x, c(x)>
Remove from S any hypothesis h for which h(x) ≠ c(x)
3. Output S, which now contains the list of consistent hypotheses.
2
The above sample algorithm illustrates learning to classify discrete objects; however
other algorithms can be designed to learn real/discrete functions, patterns, rules etc. In this thesis,
learners are grouped based on lazy and eager learning characteristics. Lazy learners delay the
processing of training examples until they must label a new query instance, thereby creating
local models on the fly based on the query, while eager learners create global approximation/s
from the training data and must use that a posteriori model to classify new instances. There are
pros and cons to both types of learners which are further discussed in Chapter 3 of this thesis.
A branch of Artificial Intelligence, Fuzzy Logic (FL), is an extension of Boolean logic to
describe partial facts so that it can model uncertainties of inadequately understood domains
(Engelbrecht 2002). It generates human readable rules in simple linguistic terms and can take
into account vagueness, uncertainty and partial descriptions. Although they have enjoyed
widespread application, they require domain experts to initially define the linguistic terms. In
this thesis, FL has been used in a neural network hybrid method developed by Jang (1993) called
ANFIS, which is further discussed in Chapter 3.
1.2 MUNICIPAL SOLID WASTE, BIOSOLIDS, AND ETHANOL
Municipal solid waste (MSW), more commonly known as garbage or trash is an
unpleasant but inevitable byproduct of modern society. The Environmental Protection Agency
(EPA 2003) estimated an average of 4.5 pounds of waste produced per person per day, which
amounts to about 236 million tons of total waste generated per year. Non-hazardous
biodegradable organic material constitutes almost half of the total waste generated in the US;
other constituents are shown in Figure 1.1.
3
Figure 1.1: Composition of the 229.2 million MSW generated in the US in 2001 (EPA 2003).
Another byproduct of modern society is biosolids, these are the solid dehydrated organic
matter produced from wastewater treatment plants with an estimated annual production of 7.5
million tons for 2005 (EPA 1999). While landfills, ocean/river dumping, combustion are
traditional methods of handling such a high volume of MSW and biosolids, it is merely a
temporary solution and has highly adverse long and short term effects on the environment.
Composting is the safest and most environment friendly method to dispose of MSW and
biosolids. Composting MSW and biosolids go hand in hand because biosolids are best
composted in combination with various organic components of MSW such as yard trimmings
(e.g. leaves, grass clippings, brush), paper and chipped wood debris (EPA 1999), which are also
called bulking agents.
Ethanol, or ethyl alcohol has been dubbed as the fuel of the future by the President of the
United States in his 2006 State of the Union Speech (Bush 2006). It can be produced in two
ways, from petrochemicals by the hydration of ethylene or biologically from the fermentation of
4
various sugars using microbes. Sugars can be found in carbohydrates from agricultural crops, or
from inexpensive, waste and abundant sources like residues of crops, grass, woods and possibly
MSW. However, extracting the carbohydrate content from residues of crops and MSW is
complicated, cost-inefficient and under developed both commercially and academically.
Composting is the suggested method that can be used to break down and free the sugar contents
of crop residues and more ambitiously MSW (Gray 1999). The reader is referred to Roehr (2001)
for details on the current and future state of the ethanol production process.
1.3 COMPOSTING AND ITS MODELING
Composting is the controlled natural biodegradation of organic waste matter into more
stable and naturally useful organic substances. Biodegradation is a biological process in which
various types of microorganisms perform the aerobic decomposition of organic materials.
Composting is a viable high-turnout solution for waste reduction in wide use across the US and
the world. A high percent of MSW can be composted into stable and less toxic material, which
reduces the load on landfills, river/ocean dumping, and combustion. This in turn decreases
pollution of water, air and soil, and thereby conserves the natural resources by processing these
organic wastes into a soil-building substance. As mentioned in Section 1.2, composting can also
play a vital role for the production of ethanol from cellulose containing crop residues, wood
chips, grass and stalks.
The dynamics of composting can be compared to the widely known simple cellular
automaton called the “Game of Life” - it would be a more complex version of the game with
various colonies of multiple species of microbes in a continually changing eco-system. It is a
perpetual survival game involving several types of microbes thriving at different stages of the
5
compost with various factors affecting them such as, time, temperature, nitrogen,
carbon/nitrogen ratio, substrate/bulking agent, pH, moisture, oxygen level (aeration), particle
size, type of microbes, nutrient balance etc. factors. The physical and chemical environment
surrounding the microbes is constantly changing, primarily as a result of consumption of oxygen
by respiration, increase in temperature, crowding, and accumulation of metabolic products and
byproducts (Liang et al. 2003a). This makes it difficult to create a unified model. In order to
predict composting a flexible yet complex high-dimensional modeling mechanism is required,
this is where machine learning can be applicable.
While composting depends on various chemical and physical factors, according to Miller
(1992), the most important parameters for the microorganisms are temperature, moisture,
oxygen, pH, substrate/bulking agent composition. Research conducted by others also agrees with
these parameters but place different emphases. However it is generally agreed that the most
dominant factors for microbial activity growth are moisture, temperature, and oxygen (aeration)
which should all be varied with time (McCartney 1998; Liang et al. 2003b; Rosso 1993) to
achieve maximal microbial activity. The aforementioned are the main attributes considered in
this research.
1.4 THESIS MOTIVATION
In spite of a 350% increase in the number of composting facilities in the US in the last 15
years (Goldstein and Gray 1999), composting only accounts for less than 30% of MSW and 21%
of biosolids recovery into nature. Other means of managing MSW, such as incineration/
combustion, landfills, river/ocean dumps, waste natural resources and have detrimental long and
short term effects on the environment. According to a 2001 EPA report (EPA 2003) on MSW, it
6
is imperative to invest in composting, recycling and source-reduction practices to sustain the
economic growth of the nation and society. However, most of the commercial composting
processes in operation today are in primitive stages; there is limited understanding of the
biological, chemical, and physical interactions during composting (Liang et al. 2003a; EPA
1995), processing efficiency is low, and costs are relatively high. Compost prices can be as high
as $26 per ton for landscape mulch to more than $100 per ton for high-grade compost which is
bagged and sold at the retail level (EPA 2006a).
Ethanol, considered as the environment-friendly renewable fuel source of the future is in
its latent stages of development in the United States. Brazil has created an infrastructure to use
ethanol produced from cane-sugar to supply almost 40% of the country’s automobile fuel needs.
It is apparent that ethanol is a feasible solution to the energy crisis in the US. It is only a matter
of time before cost-effective composting methods are developed to produce it from abundant
residue, waste crops and MSW.
While there has been research into developing mathematical models (deterministic,
stochastic etc.), empirical rule-based models and mechanistic models, these models do not work
well for accurate process control. ML techniques are attractive for modeling such complex
biological processes due to their ability to dynamically modify their behavior, store experimental
knowledge, and make that knowledge available for modeling. ML methods can also point out the
importance of specific descriptors and implicit relations among them that may prompt further
investigation. This thesis compares and studies several ML methods and their hybrids applied to
the composting domain and shows favorable modeling results in hopes of broadening the
understanding of composting models, facilitating composting applications and improving process
control.
7
1.5 RELATED WORK
Modeling composting is related to biodegradation modeling. Many sophisticated models
have been developed for commercial and academic purposes to predict certain material’s
biodegradability. EnviroSim offers a popular commercial package, BIOWIN for the prediction of
the rate of aerobic microbial biodegradation of various mixed substances using linear and non-
linear models of regression and a comprehensive data store. Baker et al. (2004) presents an
overview of various Machine Learning methods that have been applied to model biodegradation,
namely Artificial Neural Networks, Partial least squares discriminant analysis, Inductive rule and
knowledge based learning systems, and Bayesian analysis. Although these software and models
could be used to predict the rate of biodegradation of some substances, they have not been used
in tandem for composting process control.
Several researchers have modeled aerobic composting since the early 90’s using
deterministic, stochastic, mechanistic, or steady-state models. Physically based process models
introduced by Haug (1993) and further studied by Das et al. (1998) using non-linear
mathematical functions have been used by industry and municipal scale bioconversion centers.
While the formers are solely based on physical environmental factors, Stombaugh and Nokes
(1996) created a simple model based on physical-microbiological parameters in the reactor using
differential equations that considered microbial, substrate and Oxygen concentrations, moisture,
temperature and vessel size. Mechanistic approaches incorporating the non-linearity in microbial
degradation have also been developed (Liang et al 2003a,b; Xi et al. 2005) that consider the
biological component of composting and kinetics at a particle level. These models have the
practical benefit of being able to optimize certain operational parameters, but in general they are
too simple and often are not able to model the dynamics involved. There are also some software
8
simulators for basic composting modeling that do not employ Machine Learning methodologies,
e.g. STELLA, ORWARE (2002) etc.
Machine Learning methods have been successfully applied to a broad range of biological
process modeling over the past decade causing the application of the discipline to be far ahead of
the theory. Liang et al. (2003a) applied ANN’s to predict biosolids composting from a pilot scale
experiment and evaluated the model against a different experiment to achieve promising results.
Based on his earlier research, the composting problem was treated as a black box dependent on
several environmental variables. A primary drawback with ANN is that it does not yield any
human readable rules that may be used by different systems, and it is almost impossible to
interpret the model created. Furthermore, the Backpropagation algorithm employed by most
ANNs does not produce stable models, i.e., different training runs using the same dataset will
produce different models. Morris (2005) used an evolutionary method for Fuzzy learning called
FISSION in order to model composting and achieved results similar to Liang et al. (2003a).
This thesis has studied and compared the use of various other ML methods like Support
Vector Machines, Radial Basis Function Networks, Instance Based Learning, Model Trees, and
Regression Trees. It has also explored aggregate methods and hybrid ML methods namely,
Neuro-Fuzzy System, lazy-eager learners, e.g. Lazy SVM, Lazy Model Trees etc. In the past
others have used hybrid schemes; boosted lazy-decision trees (Brodley 2003) for classification,
hybrid schemes to reduce computation and memory load (Zhou et al. 1999), etc. The hybrid
developed algorithms were able to improve upon the results of published research.
9
1.6 THESIS OBJECTIVES
This project seeks to apply several machine learning algorithms and develop suitable
hybrid algorithms to model and predict composting dynamically with the intention of real-world
applicability. Results are analyzed and statistical tests are performed to understand the
comparative pros and cons of each method. This research is performed on the same composting
dataset so that it can be compared to published results (Liang et. al 2003; Morris 2005).
The modeling goal is to predict the Oxygen uptake rate, which is an indicator of
microbial activity and is directly proportional to the rate of composting. The most important
descriptors in predicting composting is moisture content, temperature and time. The data was
used from the research of Liang et al. (2003b).
Specifically, the objectives of this thesis are to:
1) Analyze the data from the pilot experiment and pre-process it for use with various
ML schemes.
2) Study and explore ML algorithms and their hybrids suitable for this domain.
3) Develop and implement proposed ML algorithms as required.
4) Evaluate and compare the predictive accuracy of the ML models developed on the
basis of their ability to predict unseen data from another composting experiment.
5) Study the performance, robustness, and applicability of the ML schemes using
statistical significance tests and uncertainties in the data.
10
CHAPTER 2
ANALZYING THE DATA
The advantage of applying ML methods is that minimal domain knowledge is required
and they can discover relations/patterns among descriptors. In order to learn the composting
function, we can use the traditional empirical methods currently employed by engineers &
scientists to design a full scale composting reactor. According to Das et al. (1997), a few steps
are ideally performed in designing a full scale composting model for a previously untested
substance. These are: (1) substance characterization, (2) pilot scale process evaluation to provide
essential data, (3) product testing, and (4) design parameter evaluation, and (5) scale up of
system. The data gathered from the pilot scale experiment can be used to train the ML system,
which can be then tested on a separate composting run.
2.1 COMPOSTING EXPERIMENT
The composting data used in this experiment was provided by the University of
Georgia’s Biological and Agricultural Engineering Department. Research conducted by Liang et
al. (2003a, b) on biosolids mixed with pine sawdust and water provided an excellent source of
composting data. Six batches of composting experiments under 6 discrete temperature settings
(22°C, 29°C, 36°C, 43°C, and 50°C) were run over a period of eight months using the same
material. Each batch simultaneously incubated two replicates of 5 composting chambers under
the same temperature but 5 different moisture settings (30%, 40%, 50%, 60%, and 70%),
providing 146 observations per replicate over a 10-day period. This resulted in 6T × 5M × 2R ×
11
146 readings, yielding a total of 8760 training instances. To provide for an independent
evaluation of the model, separate experiments were also conducted under a different temperature,
34°C, which was not present in the training data. Figure 2.1 shows the behavior of microbial
activity with time over a period of 240 hours at 36°C temperature and 70% moisture setting.
O2 rate
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
0 50 100 150 200 250
Figure 2.1: Plot of microbial activity over 240 hours at 36°C temperature and 70% moisture content. (Axes: x-indicator of microbial activity, y-time)
2.2 DIFFICULTIES IN LEARNING
There are no pre-defined rules that govern the methodology of collecting data for ML
algorithms; each algorithm has its strengths and weaknesses in dealing with various aspects of a
dataset. In the remainder of this chapter, we hypothesize on some issues that may contribute to
the difficulty in modeling composting for ML methods.
2.2a LIMITED DISCRETE VALUES IN DESCRIPTORS
Each training example has one of 5 discrete values for moisture, one of 6 discrete values
for temperature and one of 146 discrete values for the time elapsed. These limited discrete
attribute values in the training data may pose a complication to the learning algorithm. Because
12
there are complex non-linear relationships among the descriptors, it is not merely a matter of
interpolating the values in between. In other words, there are several empty regions of the input
space with no indication of the nature of the hypercurve for those regions (e.g., there are 1752
instances for each of the moisture settings of 40% and 50%, but none for 45%).
2.2b POTENTIAL DESCRIPTOR INADEQUACY
When two experiments are conducted under the same settings, one would expect similar
composting behavior exhibited, but in many cases, the behaviors are sometimes significantly
different. Due to the biological nature of composting, two instances with the same attribute
values are likely to have different target class values. For example, two instances with
descriptors having identical values of M40%, T22°C, and time of 150 hrs, have substantially
different values for Oxygen uptake rate; 0.5038 and 0.0452 mg/gH. This renders a problem to the
learner as it provides a degree of inconsistency in the problem domain. This is not an isolated
occurrence with some instances but represents a pervasive feature of this biological process. This
discrepancy is shown in the graphs of Figures 2.2 through 2.10; ideally the graphs should be the
same. Notice that the graphs of the 30% (M30%) and 40% (M40%) moisture settings usually
exhibit dramatically different behaviors; referring to Figure 2.6, we observe that for one graph
(M40% R2) composting does not start in the specified time frame, while in the other graph
(M40% R1), composting takes a long time to initialize. It should also be noted that, for the
M30% runs, composting ultimately fails to occur in both replicates at several temperatures
(22°C, 29°C, 36°C, and 57°C). Therefore, high discrepancy in composting behavior is shown at
certain settings. This may suggest the need for additional descriptors that are not considered in
the experiment.
13
Previous research has shown that moisture is possibly the most important factor in
composting (Miller 1992; Liang et al. 2003b) and our data analysis and models imply the same.
The three-dimensional plot of microbial activity against moisture and temperature shown in
Figure 5.43 also indicates that moisture has more effect than temperature. The algorithm that
generates rule sets from Model Trees (Holmes et al. 1999), M5Rules, was also able to learn this
observation as shown by the rules inferred by the algorithm; Figure 5.18 shows some sample
rules.
2.2c UNCERTAINTY IN TARGET VALUES
Due to the extremely low microbial activity readings at 30% and 40% moisture settings
(Please refer to Figures 2.2, 2.3, and 2.6) there is a possibility of having uncertainty in the target
values introduced by measuring instruments. Two inevitable aspects of experimentation are: 1)
noise introduced by sensory devices, and 2) difficulty in maintaining environmental settings.
The Oxygen sensor had a sensitivity level of 0.0065 mg(O2)/g Hr, and a histogram shows that
2919 of the 8760 training examples had target values in the range of 0 to 0.065 mg/g Hr of
Oxygen. Although the oxygen sensor was a high precision device, it is likely that uncertainty had
been introduced into the target values. The target class values had a mean of 0.445 mg/gH and a
standard deviation of 0.391 for the training data (8760 instances). Combined with the effect of
unknown descriptors involved in composting and the difficulty to control experimental settings,
it is clearly noticeable that a high level of uncertainty is involved.
14
Microbial Activities for 2 replicates at Temp=22C and Moisture=30%
0.00
0.01
0.01
0.02
0.02
0.03
0.03
0.04
0 60 120 180 240Time
Oxy
gen
Upt
ake
Rat
e M30% R1 O2 rate
M30% R2 O2 rate
Figure 2.2: Discrepancy in microbial activity between two identical
experiments (replicates) at 22°C and 30% moisture.
Microbial Activities for 2 replicates at Temp=22C and Moisture=40%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0 60 120 180 240Time
Oxy
gen
Upt
ake
Rat
e M40% R1 O2 rate
M40% R2 O2 rate
Figure 2.3: Discrepancy in microbial activity between two identical
experiments (replicates) at 22°C temperature and 40% moisture.
Microbial Activities for 2 replicates at Temp=22C and Moisture=50%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
0 60 120 180 240Time
Oxy
gen
Upt
ake
Rat
e
M50% R1 O2 rate
M50% R2 O2 rate
Figure 2.4: Discrepancy in microbial activity between two identical
experiments (replicates) at 22°C temperature and 50% moisture.
15
Microbial Activities for 2 replicates at Temp=22C and Moisture=70%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0 60 120 180 240Time
Oxy
gen
Upt
ake
Rat
e
M70% R1 O2 rate
M70% R2 O2 rate
Figure 2.5: Discrepancy in microbial activity between two identical
experiments (replicates) at 22°C temperature and 70% moisture.
Microbial Activities for 2 replicates at Temp=29C and Moisture=40%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.00 60.00 120.00 180.00 240.00Time
Oxy
gen
Upt
ake
Rat
e
M40% R1 O2 rate
M40% R2 O2 rate
Figure 2.6: Discrepancy in microbial activity between two identical
experiments (replicates) at 29°C temperature and 40% moisture.
Microbial Activities for 2 replicates at Temp=43C and Moisture=30%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.00 50.00 100.00 150.00 200.00Time
Oxy
gen
Upt
ake
Rat
e M30% R1 O2 rate
M30% R2 O2 rate
Figure 2.7: Discrepancy in microbial activity between two identical
experiments (replicates) at 43°C temperature and 30% moisture.
16
Microbial Activities for 2 replicates at Temp=43C and Moisture=40%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
0.00 50.00 100.00 150.00 200.00Time
Oxy
gen
Upt
ake
Rat
e
M40% R1 O2 rate
M40% R2 O2 rate
Figure 2.8: Discrepancy in microbial activity between two identical
experiments (replicates) at 43°C temperature and 40% moisture.
Microbial Activities for 2 replicates at Temp=57C and Moisture=40%
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.00 60.00 120.00 180.00 240.00Time
Oxy
gen
Upt
ake
Rat
e M40% R1 O2 rate
M40% R2 O2 rate
Figure 2.9: Discrepancy in microbial activity between two identical
experiments (replicates) at 57°C temperature and 40% moisture.
Microbial Activities for 2 replicates at Temp=57C and Moisture=50%
0.00
0.20
0.40
0.60
0.80
1.00
1.20
0 60 120 180 240Time
Oxy
gen
Upt
ake
Rat
e
M50% R1 O2 rate
M50% R2 O2 rate
Figure 2.10: Discrepancy in microbial activity between two identical
experiments (replicates) at 57°C temperature and 50% moisture.
17
These observations can help us decide on the ML algorithm, provide guidance for
parameter tuning, and suggest any necessary data pre-processing. In summary, the data contains
certain inconsistencies, noise, possibly insufficient descriptors for the domain, and a narrow
range of discrete values (6 for temperature, 5 for moisture, 146 for time) to provide adequate
coverage of distribution of the real-valued attributes. Approaches to minimize their effects
during learning can include 1) taking the mean of the output class values of the two replicates, 2)
using more attributes, 3) using multiple models to predict, or 4) pruning noisy exemplars (these
are already-seen instances, used for classification, that have been identified as problematic or
outliers) (Aha 1992; Witten 2005) etc.
18
CHAPTER 3
MACHINE LEARNING METHODOLOGIES
3.1 BACKGROUND
Machine Learning algorithms can be thought of as search algorithms that traverse the set
of all possible hypotheses H (or concepts to be learned) in order to find a hypothesis h that can
be used to describe the training instances. Let us refer back to Section 1.1 to continue on the
definition and description of the simple List-Then-Eliminate algorithm. We have correctly
assumed that given the right examples, this learner will successfully learn our intended target
concept - (numWheels = 4, canMove = true, object = car). This type of learner is able to classify
unknown instances due to an inherent bias that guides them during the search of H, and this
methodology is referred to as Inductive Learning (Mitchell 1997). The set of assumptions made
in an algorithm to guide its search is called the Inductive Bias; for this algorithm it is the
assumption that the target concept exists in the hypothesis space (c ∈ H). Occam’s Razor is a
very famous inductive bias employed by ML algorithms like ID3 for Decision Trees (Quinlan
1986), which states that simpler hypotheses should be preferred over complex ones.
The List-then-Eliminate algorithm is not pragmatic as the computational complexity in
exhaustively searching the hypothesis space (which may be incomplete) is of the order:
)()(1
HnOanO i
m
i=∏
=…. .........................................................................................(3.1)
where ai = ith attribute of an instance, |ai| = cardinality of discrete valued ai attribute, n = number
of instances, m = number of attributes, and |H| is size of the hypothesis space. A problem arises
when a target concept is too complex to be represented by the learner’s expressive power. This
19
can happen if the List-Then-Eliminate algorithm encounters an instance like, (numWheels = 4,
canMove = true, object = shoppingCart). In the case it is a query instance, the learner will
misclassify it as a car, conversely if it is a training instance, the learner will fail to learn the
concept. A solution to this situation would be to enrich the concept space; for example by adding
more descriptors, instead of merely using two attributes to classify a car. Another solution to
enriching the concept space can be by using a more expressive representation, for example, by
using disjunctions as well as conjunctions in the representation, we can express any finite
discrete-valued function (Mitchell 1997). Another predicament appears when the training
instances presented to the learner contains erroneous information, for example, (numWheels = 0,
canMove = true, object = car), this instance would either 1) lead to the learning of a false
concept, or 2) cause the algorithm to fail to converge upon a concept (and not learn anything)
due to conflicting information. Therefore, the number of descriptors, expressiveness of the
learner, erroneous data, and other factors have to be taken into account when considering and
developing Machine Learning algorithms.
The hypothesis space H explodes when the target class and/or attributes are not discrete,
but rather real numbers; referring back to equation (3.1), |a| = ∝, thereby making, |H| = ∝. The
simple List-Then-Eliminate algorithm will not work since it is not computationally feasible to
traverse through infinite H. It is then, no longer a classification problem but rather a regression or
function approximation problem. Fortunately, there are ML methods for real-valued prediction
or modeling problems often based on principles drawn from classification methods and other
fields like biology, statistics, mathematics etc. Although techniques from other disciplines like
regression analysis, multivariate spline regression, extrapolation, interpolation, etc. are available
to approximate the target function, ML methods have proven to be very successful and efficient.
20
The next sub-sections provide an overview of the ML methods used in this thesis, which are,
Artificial Neural Networks (ANN), Support Vector Regression, Radial Basis Function Networks,
Rule-inferring systems, Model and Regression Trees, and Instance Based Learning methods.
3.2 EAGER AND LAZY LEARNERS
Eager learners are those algorithms that use the training instances to create an a posteriori
model which is later used to classify query instances. These learners explicitly commit to a single
hypothesis, which is a global approximation of the target function that covers the entire instance
space. They cannot subsequently alter their global hypothesis, until retrained. In most domains
where the concepts do not change frequently (e.g. games, object recognition, locomotion etc.)
eager learners are highly suitable and successful. However, composting behavior radically
changes based on various factors unable to be taken into consideration – a model created with a
certain substance under a specific temperature is not likely to be applicable on a different
substance and/or temperature.
Instance-Based Learners, also known as lazy learners do not explicitly commit to a global
approximation instead they create a hypothesis when given a query instance. This increases their
expressive ability by creating many local approximations to the target function, consequently
making them more flexible by being able to modify parts of their hypothesis on the fly. These
properties of lazy learners make them enticing to this domain; as a result we have the considered
k-Nearest Neighbor (k-NN) and Locally Weighted Regression (LWR) algorithms. They do not
perform any computation initially with the training instances; they merely store the instances for
use during classification, which is when most of the computation is performed. Using a distance
function (e.g. Euclidean, Hamming or Levenshtein etc. distance) the nearest neighbors to the
21
query instance are selected, which is then used to construct a hypothesis. Applying the locally
created hypothesis it then performs classification/prediction.
To illustrate the benefit of lazy learners, let us consider a hypothesis space consisting of
linear functions, the eager learner will learn a single linear function to cover the entire training
instance space and future instances. Conversely, although a lazy learner may be using linear
regression, it effectively uses a richer hypothesis space because it uses many different local
linear functions to form its implicit global hypothesis of the target concept.
It is computationally expensive to compute the distance for every single instance during
classification, which is of the order O(mn), where n is number of instances, m is number of
attributes. However, efficient data structures have been developed to quickly retrieve nearby
instances based on the distance function, e.g. kd-trees (Bentley 1975; Witten & Frank 2005)
which his further discussed in Section 4.1.
The following is a list of the various terminologies used in the proceeding sections:
n – the total number of instances
m – the number of attributes in an instance.
xi or ix - the ith instance which can be thought of as a vector of m dimensions.
or - rth attribute of the instance xi. ira )(
1ia
wj – the weight corresponding to the jth attribute.
- the actual output value of target class for the ith instance. )(ix
)(xf - the target function that is to be learned.
)(ˆ xf - the learned function which is an approximation of the target function.
)(ˆ ixf - the predicted output value of the target class for the ith instance.
22
3.3 LINEAR REGRESSION
Linear Regression is one of the simplest function approximation techniques that can be
used when all the attributes are numeric and the concept to be learned is fairly simple –
specifically if there is a linear relationship between the attributes and the target class. It can be
used to classify or predict and is presented here as a building block for other algorithms
described in this section. If the instance vector xi from the instance space is
, then the learned function>< )()(3
)(2
)(1 ,...,,, i
kiii aaaa )(ˆ xf , which is an approximation of the target
function )(xf , after regression is of the form:
)1(
0
)1()1(11
)1(00
1 .......)(ˆj
k
jjkk awawawawxf ∑
=
=+++== …. .............................................................(3.2)
The goal in this learning is to choose weights, w0, w1 ,…, wk for the above equation that will
minimize the sum squared error E given below (this is also the inductive bias that guides the
search through the space of all possible weight values):
…....................................................................................................(3.3) 2
0 0
)()( .∑ ∑= =
⎟⎟⎠
⎞⎜⎜⎝
⎛−=
n
i
k
j
ijj
i awxE
Using Equation (3.2) we can rewrite Equation (3.3) as,
(2
0
)( )(ˆ∑=
−=n
i
ii xfxE ) ….......................................................................................................(3.4)
Where, is the actual output class value of ith instance, and n is the total number of
instances. LWR approximates the target function using k neighbors, therefore in Equation (3.4),
n is replaced by k instances. The gradient descent algorithm can be used to find the weights of
Equation (3.2). A shortcoming of this method is that it can only find hypotheses of linear
expressiveness; if the target concept is nonlinear then regression will yield poor results.
)(ix
23
3.4 k-NEAREST NEIGHBOR
This is a popular and thoroughly studied method that has been useful for a wide range of
practical problems in spite of its simplicity. The lazy learner, k-nearest neighbor algorithm (k-
NN), finds the k nearest neighbors of the query instance using a distance function (e.g.
Euclidean, Hamiltonian, Levenshtein etc.), then another function is used to return a discrete or
real output. Euclidean distance function in an m-dimensional space (where m is the number of
attributes) is described in Equation (3.5). Using the notations described in the beginning of this
chapter, the Euclidean distance function is with respect to two vector instances, xi and xj is:
Euclidean Distance = d(xi, xj) = ( )∑=
−m
r
jr
ir aa
1
2…. ..................................................................(3.5)
The value of k is usually determined empirically or through cross-validation techniques.
Once the neighbors are selected, there are several algorithms to compute the output value, for
example (k is the number of neighbors, qx is the query instance, 1x … kx is the k instances
nearest to qx ):
1) For discrete function, Vf n →ℜ: : ∑=∈
=k
ii
Vv
q xfvxf1
))(,(maxarg)(ˆ δ ….......................(3.6)
where 1),( =baδ if a = b, else it is equal to 0. A majority voting scheme can also be
used, which makes the algorithm similar to the Naïve Bayes method.
2) For real function, ℜ→ℜnf : :k
xfxf
k
iiq ∑= =1)(
)(ˆ …......................................................(3.7)
24
3) Weighing the contribution of each neighbor based on distance:
The weight of the ith neighboring instance can be calculated as, 2),(1
iqi xxdw =
Using wi in equations (3.6) and (3.7) we get,
∑=∈
=k
iii
Vv
q xfvwxf1
))(,(maxarg)(ˆ δ and ∑
∑=
=
=k
ii
k
iii
q
w
xfwxf
1
1)(
)(ˆ respectively.
Due to the simple yet effective nature of this algorithm, it is susceptible to pitfalls – the
curse of dimensionality caused by too many irrelevant attributes can confuse the algorithm,
query classification can take a long time causing slow performance on large training sets, and it
does create explicit generalizations about the data that maybe easily visible. Aha et al. (1991)
developed techniques to overcome some shortcomings of the k-NN algorithm: 1) noisy examplar
pruning to better handle noisy instances, 2) attribute weighting to minimize the effect of
irrelevant attributes, and 3) framework for inferring the hypothesis learned.
Naturally one may ask if k-NN does not perform any training-time computation to build a
model, what hypothesis has it learned? It builds one using the k neighbors during query time to
classify; this has the implicit effect of splitting the instance space into many distinct regions. One
way to view the hypothesis of the k-NN algorithm is by extracting rules that bound each region;
in this manner rule-sets can be inferred. It can be shown that when k and n, the number of
instances, both become infinite in such a way that k/n → 0, the probability of error approaches
the theoretical minimum for the dataset (Witten & Frank 2005). In that case k-NN approaches
the Naïve Bayes Classifier, as it takes a weighted vote among all members of the instance space.
25
3.5 LOCALLY WEIGHTED LINEAR REGRESSION (LWR)
The concepts of Linear Regression and k-NN are applied in LWR, which attempts to
approximate a real valued target function by selecting some adjacent points in the hyperspace
and fitting them with a local approximation (e.g. a line using Linear Regression). These regional
piecewise approximations can then be combined to successfully approximate the entire target
function. The local approximations are made using a linear, quadratic, Gaussian function etc.
which are called kernels. In order to use LWR effectively, it is necessary to decide upon distance
and weighing functions. Similar to k-NN, Euclidean distance can be used and there are several
choices for the weighing functions, e.g. linear, Epnechinikov, Tricube, Gaussian or simple
constant (Atkeson 1997). Figure 3.1 illustrates a linear piecewise approximation to the neighbors
of a query point x in 2-dimensional space. The choice of kernel can affect the weight given to
each neighbor, as well as the number of neighbors selected. Gaussian kernels are portrayed
below.
Figure 3.1: The nearest neighbors to the query point x are used fit a linear equation.
The number of neighbors considered and the weight given to them can be determined by a kernel function (the shaded area under the curve above).
26
3.6 RADIAL BASIS FUNCTION NETWORKS (RBF)
RBF networks represent an interesting blend of instance-based and neural network
learning algorithms (Broomhead and Lowe, 1998). In this approach, the learned hypothesis is a
function that is a linear combination of m functions, often called basis or kernel
functions or centers (Mitchell 1997). It is of the following form:
)],([ xxdK uu
∑=
+=m
uuuu xxdKwwxf
10 )],([)(ˆ … .............................................................................................(3.8)
The ability of RBF networks stems from the freedom to choose different values for the
weights and kernel functions. The basis function is defined so that it decreases as the distance
increases. Although ) is a global approximation to the target function, each of the
m basis functions only contribute to a region near the point xu in the instance space. Like LWR
this has the effect of forming a smooth piece by piece fit of the target function as illustrated in
Figure 3.2. Usually the basis functions are Gaussian functions centered at a point xu with some
variance , although many others exist. The weights are usually found using the global error
criterion defined in Equation (3.4); other measures also exist.
),( xxd u (ˆ xf
2uσ
RBF networks give every attribute the same weight since all are treated equally in the
distance computation. This renders them ineffective in dealing with irrelevant attributes, unlike
multi-layer perceptrons which can automatically handle this. When the training example set is
too large and/or the target function is too complex, more centers (basis functions) should be used
and/or strategically placed, otherwise the RBF network may not form a good approximation as
illustrated in Figure 3.2. To alleviate this situation clustering algorithms can be applied to the
instances to suitably place basis functions in the instance space. RBF networks require less
computational resources to train compared to ANN’s feedforward backpropagation algorithm.
27
Figure 3.2: Illustration of fitting training points using an RBF network with 10 Gaussian kernels (centers or basis functions). The width of the Gaussian Kernels is the standard deviation (0.5 in
this case). Observing the approximated graph we notice that 10 kernels are not enough for a good approximation of this many instances (100). Created with RBF Java Applet (Hong 1996)
3.7 SUPPORT VECTOR MACHINES (SVM)
SVM originated from research in statistical learning theory (Vapnik 1999). SVM uses
linear models to implement nonlinear classification boundaries by transforming the input space
into a higher dimensional space using a non-linear mapping. Using principles of computational
learning theory, in simplified terms we can state that, a hyperplane in the new higher
dimensional space is likely to be a complex curvy hyperplane in the lower dimensional space,
illustrated in Figure 3.3. This principle is exploited by SVM to classify complex instance spaces.
SVM principles are can also be extended to approximate real-valued target functions, which is
also called Support Vector Regression (SVR), both terms are synonymously used in this thesis.
SVM are also called kernel machines, they can use different functions to compute the
distance between two instance vectors. Due to the increased number of calculations involved in
the kernel computations, SVM are inherently slow, however a very effective algorithm,
Sequential Minimal Optimization (SMO) developed by (Platt 1998; Smola 1998) can be used to
aide the implementation as it runs in O(n2). SVM makes use of the maximal marginal hyperplane
28
and a regression method using the ε-tube, which strongly prevents it overfitting the data. This
creates an excessive smoothing effect as shown in the graph of Figure 5.12.
Figure 3.3: SVM principle illustrated, using dimensionality transformation,
from 2D to 3D to find a hyper-plane that separates the instances.
Recently SVM have been stealing much of the limelight enjoyed by ANN, probably due
to the fact that it requires much less parameter tuning, thereby making it easier to use. SVM
share the same problem as RBF networks in dealing with irrelevant attributes. In fact, support
vector machines with Gaussian kernels (RBF kernels) are a particular type of RBF network, in
which basis functions are centered on certain instances, and the outputs are combined linearly by
computing the maximum margin hyperplane (Witten and Frank 2005).
3.8 REGRESSION TREES
Regression trees, based on decision trees and proposed by Breiman (1984) are able to
approximate real-valued functions. Decision tree learning is a powerful eager ML method
capable of learning any finite discrete-valued function by searching a powerfully expressive
29
hypothesis space which is complete. They learn disjunctive expressions and are robust to noisy
data, missing values, and errors. The regression tree building algorithm generates a decision tree
in which the leaves are marked with real values and the nodes with tests for real/discrete
attributes as illustrated in Figure 3.4. Because the tree is greedily grown as training instances are
encountered, this may cause the tree to over-fit the data, consequently resulting in poor
performance for unseen examples. To overcome this problem, pruning methods use validation
sets and cost-complexity measures. In spite of pruning, regression trees often yield trees that are
cumbersome and difficult to interpret. In contrast to a single global regression equation, this
method implicitly builds many regression functions to approximate the target function. The
decisions made at each internal node can be thought of as dividing the instance space and
discovering patterns in the data. These properties make it perform much better than simple
regression and allow it to handle non-linear relationships.
Figure 3.4: An example of a regression tree predicting real values.
30
3.9 MODEL TREES
Composting facilities in operation today, use empirical and rule-based models that have
been created from pilot experimental runs (Das et al. 1998). This implies that algorithms
producing rules would be highly applicable to this domain. Hence, we consider Regression and
Model Trees and a variant algorithm which uses trees to find rules, called M5PRules (Frank &
Witten 1998). Model trees are decision trees with linear regression equations at the leaves
instead of terminal output values as shown by Figure 3.5. Thus they offer a more sophisticated
approach for predicting numeric values and are appropriate when the attributes are also numeric.
The M5 algorithm proposed by J. R. Quinlan (1992) for constructing model trees is described
below.
Figure 3.5: An example of a model tree (subsection) for composting
showing 7 linear regressions at each leaf node; LM44-LM50.
Model trees are constructed by a divide-and-conquer approach similar to decision trees.
The set of training examples, T is either associated with a leaf, or some test is chosen that splits
T into subsets based on the test outcome and the same is applied recursively to the subsets. The
31
first step in building the tree is computing the standard deviation of the target values of the
training examples, which is used as a measure of the error at that node. Then the expected error
reduction is calculated as a result of testing an attribute at that node. Using a greedy approach,
the attribute which maximizes the expected error reduction is chosen for splitting at that node.
The expected error reduction, or Standard Deviation Reduction (SDR), is calculated by,
∑ ×−=i
ii Tsd
TT
TsdSDR )()( … ..............................................................................................(3.9)
Where T1, T2, …, Tn are the sets produced from splitting the node according to the chosen
attribute. The splitting process terminates when the target values of the instances that reach a
node vary only slightly, or when just a few instances reach the node.
When a model tree is used to predict the target value for an unseen instance, the tree is
followed down to a leaf as in a decision tree, using the instance’s attribute values to make routing
decisions at each node. The leaf contains a linear model based on some of the attribute values,
and this is evaluated for the test instance to yield a predicted value. However instead of using this
value directly, a smoothing process is used to compensate for the sharp discontinuities that will
inevitably occur between adjacent linear models at the leaves of the pruned tree, particular for
some models constructed from a smaller number of training examples.
Smoothing can be done by producing linear equations for each internal or leaf node, as
well as leaf nodes during the treed building time. A linear model is created at each node, using
standard regression techniques considering only the attributes that are tested in the sub-tree
below this node. Once the leaf model is used to obtain the raw predicted value for a test instance,
that value is filtered along the path back to the root, which is then combined with the value
predicted by the linear model. A smoothing calculation used to combine all the values at all
32
nodes is (Frank et al. 2005):knkqnpp
++
=' , where p’ = prediction passed up to the next higher
node, p = prediction passed to this node from below, q is the value predicted by the model at this
node, n is the number of training instances that reach the node below, and k is a smoothing
constant. Empirical results show that smoothing substantially increases the accuracy of
predictions.
3.10 ARTIFICIAL NEURAL NETWORKS (ANN)
An ANN is based on biological neurons and has been highly successful in capturing
complex non-linear relationships among many variables. There is much literature on ANN and it
has been successfully applied to many fields, including our domain of study using the same
dataset. Previously ANN has given good results on modeling composting by Liang et al. (2003b)
and hence it is considered in this thesis for comparative purposes. A brief overview of ANN will
be presented in this sub-section but the reader is suggested to the publication by Liang et al.
(2003a) for a more thorough exposition on the usage of ANN to this domain.
An artificial neuron (AN) also called a perceptron, is basically a function of all the
weighted inputs that come into it. It uses an activation function based on the net input signal
(usually a summation or a product) and a bias to output a real value. There are several types of
activation functions which usually have the property of being a monotonically increasing
mapping, e.g. the popular Sigmoid function. The goal of a single perceptron is to learn the
weights of the various inputs that come into it.
Mimicking the structure of the brain, we can create an interconnected network of
perceptrons which is known as an artificial neural network. The weights are learnt subject to
minimizing an error criterion, which is usually the sum of the squared errors of all the instances.
33
If we think of the hypersurface produced by the error as a function of all the weights, then our
goal is to find the lowest point on that surface. Therefore the learning process is to find the
weight vector that minimizes the error. The possible number of weights is infinite, so the
hypothesis space is quite vast, however it can be proved that a single perceptron can learn any
linear function using the perceptron training rule given the learning rate is sufficiently small
(Minsky & Papert 1969; Mitchell 1997).
It has been proved that Feedforward Neural Networks (FFNN) with monotonically
increasing differentiable functions can approximate any continuous function with one hidden
layer provided that the hidden layer has enough hidden neurons (Hornik 1989). However, due to
of the nature of the Backpropagation algorithm, training time is usually slow and can take up to
several thousands of iterations. Furthermore, the computational complexity in training an
exponential number of neurons is impractical; therefore, often we must consider the ratio of
approximation quality to the time required. A major concern in using ANN is that it is a black
box approach – there are no human understandable indicators to show what the network has
learnt. Also it is not possible to encode prior knowledge into an ANN. Although they might
perform very well with the training data and in some test cases, there is always the chance it may
fail for some unknown cases. There are many advantages to using this powerful ML tool but in
order to make a wise decision about the algorithm a comparative table is presented below in
Table 3.1.
34
Table 3.1: Pros and Cons of an Artificial Neural Network Advantages Disadvantages
Very fast query performance (classification) Training time is considerably slow. Robust to noise Creates a black box model, with no human readability,
cannot be ported. Robust to missing values Cannot assimilate existing domain knowledge Can handle irrelevant attributes Different training runs using the same settings and data
will create different models due to the nature of the Backpropagation algorithm.
Can handle a large number of attributes Creates an excessive smoothing effect. Refer to Figure 5.4 and Section 5.3 for details.
Can deal with sensory data, that is not human readable.
Many parameters need to be set, requiring strong familiarity with ANN. Structure of network is an important consideration.
Cannot modify model once it is already trained.
3.11 HYBRID METHODS
The lazy versions of eager learning method are designed to improve performance and
prediction accuracy (Zhou et al. 1999). The k-nearest training instances close to a testing
instance are selected and then the k neighbor instances are used to build a model during testing
time. The k-nearest methods of model tree and support vector machine regression are developed
for this research and coded in Java programming language. Even if model tree and support vector
machine regression are eager methods, the models are created during testing time. Because these
learning methods are lazy, model building time is non-existent, but testing time may be longer
than that of original eager versions. These methods can sometimes avoid over-fitting because
they use only k-nearest neighbors to build models. These methods are more robust to training
noise as well (Mitchell 1997). One of the key advantages of these methods is their flexibility to
change the model created during query time. In an environment like composting this has proven
to be successful.
35
3.12 ADAPTIVE NEURO-FUZZY INFERENCING SYSTEM (ANFIS)
This is a true hybrid machine learning method with a Fuzzy Inferencing System (FIS) at
the core and an adaptive ANN to learn the parameters of the FIS. This method has the advantage
of using minimal domain knowledge and producing human readable rules and has been
developed by Jang (1993). Other ways of learning the parameters also exist, e.g. Morris (2005)
used an Evolutionary Algorithm to evolve the membership functions. Traditionally FIS generally
requires some prior knowledge usually in the form of human domain expertise to create
linguistic variables, which may not always be available. ANN is capable of handling complex
relationships among many variables, and FIS can model vagueness, produce human-readable
rules and is transparent. Thus, the neuro-fuzzy method combines the best of both paradigms
where minimal prior knowledge is required and the powerful learning ability of ANN can be
used to create human readable rules/knowledge. An implementation of Jang’s algorithm is
available from Matlab’s package called ANFIS, which was used in this research. ANFIS
exhibited excellent performance using 216 rules to describe the composting process and
improved upon the results of previous ANN models.
3.13 COMBINING MODELS - ENSEMBLE APPROACHES
The idea behind ensemble approaches is quite obvious – they try to make predictions
more reliable by combining the output of several different models. The most prominent methods
are called bagging, boosting and stacking (Witten & Frank 2005). The remainder of this
subsection describes the ensemble approaches that were used, followed by a discussion of their
usability.
36
Bagging uses the same learning scheme to create multiple models from different
samplings of the training data. The models are trained on randomly sampled bags from the
training data with replacement and then their predictions are combined by a weighted average.
This may appear to be redundant but it allows several hypotheses to be developed concurrently
by each of those learning methods, and surprisingly those hypotheses turn out to be quite
different from each other due to the stochastic nature of ML. This is useful when a single model
cannot converge on a hypothesis to describe most of the training data or when there is the
possibility of the search algorithm to be stuck on a local optimum. Thinking in terms of the
search problem, bagging allows methods to simultaneously rake the hypothesis space. Using
multiple searches it is unlikely that two methods will take the same search path to create the
same model.
Boosting can be thought of as an enhanced version of bagging, where the same machine
learning algorithm is used to create multiple models, but in a way that complements each other
(Witten & Frank 2005). Boosting, unlike bagging, does not build the models separately, but
rather learns from the performance of previously built models. It encourages new models to
become experts for training instances misclassified by preceding models (it does so by assigning
weight values to instances). Additive regression, based on boosting principles allows boosting to
be applied for regression models rather than classification. For a detailed explanation of this
method please refer to Witten & Frank (2005).
Stacking, unlike bagging, is built by combining different learning algorithms. It is less
popular than the formers due to its difficulty in analysis and lack of a generally accepted best
practices; there are too many variations of the basic idea (Witten & Frank 2005). It uses a
metalearner (level-1 learner) to replace the voting and weighted average mechanism used in
37
bagging. The metalearner learns a functional relationship between the outputs of the base
models, also called the level-0 models. Because multiple learners can be combined in a
sophisticated manner, it can take the best from each algorithm. However, it is very slow and each
learner has to be adjusted, requiring specific understanding. Furthermore, the wide choice of
level-0 learners also presents a problem because there is no rule about the selection of level-0
learners, other than performing several empirical runs using a combination of learners based on
user expertise.
In most cases combining multiple models has the effect of improving predictive
performance, although the reverse can also happen. Improvements are not guaranteed, theoretical
discussions and degenerative cases exists that show failure. A major disadvantage of these
ensemble methods is that they are difficult to analyze, especially with incomprehensible models
(Witten & Frank 2005). When they perform well, it is not easy to understand what factors are
contributing to the improved decisions. Some researchers cast doubt if aggregate methods really
have anything to offer over traditional ones if properly used (Seeger 2001). Situations may arise
where the increase in accuracy versus computational cost of training and classifying may not be
worth the additional effort.
38
CHAPTER 4
IMPEMENTATION DETAILS
4.1 LAZY METHODS
The lazy learning algorithms used in this thesis, k-NN and Locally Weighted Regression,
were implementations from the WEKA package (WEKA 2006; Witten & Frank 2005). WEKA
standing for Waikato Environment for Knowledge Analysis is an open source software package
for data mining, written in Java. It is powerful, flexible, and expandable, allowing machine
learning schemes written in Java to be incorporated into the package. It has a graphical user
interface (GUI), command line interface, scripting capability and sophisticated full-featured API,
which is one of the primary reasons for using WEKA in this thesis.
The best implementations of k-NN uses a kd-tree (Bentley 1975; Witten & Frank 2005)
to store the training instances, which provides O(logn) for insertion, deletion and query, and
O(nlogn) time for creation of the tree. For k-NN, values of k (the number of neighbors to use)
were varied in certain intervals from 3 to 50 to find the best number. Cross validation can also be
used to find the best value of k, although it is computationally expensive. Inverse distance
weighting function was used to penalize neighbors. Due to the small number of attributes in this
dataset, we did not have to apply techniques to deal with irrelevant ones. The range of values in
the attributes (moisture: 30%-70%, temperature: 22°C-57°C) was also comparable, so no
normalization techniques were used. We did not put an upper limit on the window size for
training instances stored in memory.
39
The LWR implementation in WEKA is based on Akeson’s (1997) algorithm. The k
nearest neighbors are found based on the Euclidean distance using values which performed best
based on cross validation (between 20 and 30). The Gaussian weighting kernel is then applied to
the k neighbors, whose values are then combined in a decision stump to generate the output.
Often several runs are required to obtain the best parameters; Java programs were written to
automate this process, screenshot shown in Figure 4.3.
4.2 EAGER METHODS
Eager method implementations required for this project were available from the WEKA
package: Support Vector Machines, Artificial Neural Network, RBF network, linear regression,
model trees and regression trees. Figure 4.1 provides the graphical interface of WEKA
Explorer’s classification panel.
Figure 4.1: WEKA Explorer’s GUI Classification Panel.
40
WEKA’s SVM implementation used the Sequential Minimal Optimization (SMO)
algorithm developed by Smola and Scholkopf (1998) for training a support vector regression
model. Although not required for our dataset, it can globally replace all missing values and
transform nominal attributes into binary ones. All attributes were normalized, and the complexity
factor of SVR was varied from 0.1 to 1000 in uneven steps. It is necessary to carefully set the
value for epsilon (the amount to which deviations are tolerated when fitting the curve, a value in
the SVM equation) because it somewhat controls the overfitting extent. When using the
polynomial kernels, we did not feel it was necessary to transform the input space by an exponent
higher than 3, as cubic transformations will suffice for most complex surfaces in the feature
space (Isozaki 2002). Additionally, RBF kernels were also used, and the gamma for the kernel
was adjusted accordingly.
ANNs and RBF networks are types of feedforward networks because they do not contain
any cycles and the network’s output depends only on the current input instances. The RBF
network implementation from WEKA used normalized Gaussian kernels with the k-means
clustering algorithm to provide the basis functions. It standardizes all numeric attributes to zero
mean and unit variance, and next applies the clustering algorithm to generate clusters, and then
uses symmetric multivariate Gaussians to fit the data from each cluster. The number of clusters
(for k-means to generate) and the ridge value of the linear regression were adjusted empirically
through cross validation. The ANN implementation in WEKA provides a GUI panel (shown in
Figure 4.2) in which the nodes, structure and connections of the neural net can be customized,
however a major drawback in the use of WEKA’s neural network is the inability to save and load
a network structure once created. Because ANNs were not a significant focal part of this research
we did not use structural variations of networks, e.g. Ward networks, Jordan-Elman networks
41
etc. The WEKA implementation uses a sigmoid activation function, has abilities to decay the
learning rate per epoch, can use a seed to randomly initialize the weights, and has other standard
options. However, it has no heuristics for automatically suggesting the number of hidden nodes.
Figure 4.2: Sample structure of an ANN used for prediction with
a single hidden layer consisting of 20 neurons. (WEKA)
For Model Trees, a modified implementation of Quilan’s (1992) M5 algorithm was used
which is capable of post-pruning to reduce overfitting and smoothing to compensate for sharp
discontinuities that will likely occur between adjacent linear models at the leaves of the pruned
tree (Witten & Frank 2005). To build a regression tree in WEKA, a decision tree induction
algorithm is used to build an initial tree using information gain/variance in a greedy manner.
Afterwards it uses reduced-error pruning with backfitting to generate regression equations along
the path back up. It produces stable models, which means given the same training set, it will
output the same model. It can handle noise, missing attribute values and irrelevant attributes.
42
4.3 HYBRID METHODS
The 4 dimensional instance space represented by time, moisture, temperature, and oxygen
uptake rate appears to have seemingly random patterns for areas that could not be captured by
experimentation. In order to interpolate those complex areas of hyper-curves we developed an
algorithm that uses powerful eager learners like SVM and Model Trees to build a model during
query time from several selected nearest neighbors. Writing our own hybrid method instead of
using the Knowledge Flow feature of WEKA to graphically design customized hybrid schemes
gave us more control. It was written in Java 5.0 programming language using the powerful API
features provided by WEKA keeping in mind integration with the GUI. The lazy-eager methods
used k-NN with Euclidean-distance as the lazy component and one of 3 other eager methods
namely parameter tuned SVM, Model Trees, and RBF Network. We did not use a weighing
kernel function for the k neighbors found, because we felt a powerful eager learner would be
able to implicitly weigh each instance during training time.
4.4 HYBRID NEURO-FUZZY SYSTEM
We continued our hybrid learning scheme experimentation with Matlab’s implementation
of the neuro-fuzzy scheme developed by Jang (1993) in a package called ANFIS (Adaptive
Neuro Fuzzy Inferencing System). ANFIS provides a platform based on Matlab to
programmatically or graphically use the learning scheme. Several utility functions were written
in Matlab’s programming language for partitioning and manipulating the data for testing and
training. An example structure of the adaptive network is shown in Figure 4.4.
43
Figure 4.3: Screenshot of Lazy-Eager hybrid algorithm running in batch mode.
Figure 4.4: The structure of an adaptive neuro-fuzzy system created by ANFIS.
44
CHAPTER 5
EVALUTION, RESULTS AND ANALYSIS
In this thesis, we have employed various types of ML methods to understand the
applicability of ML schemes to this domain and extract an accurate model. Hence, if one model
performs better than another, can we state that it is a better model? The difference in model
accuracy based on one dataset may be due to estimation errors and should not be the sole basis to
conclude on comparative model performance. Alternatively, how do we know if one ML scheme
is actually better than another for this domain? The difference in method performance may be a
chance effect (Witten & Frank 2005) in the estimation process. This chapter attempts to answer
these questions based on experimental results and statistical tests. It presents quantitative
measurements of performance, followed by graphical presentation of how well each model
approximated the evaluation dataset. Finally, statistical significance tests are also performed to
compare the various learning methods within at least 95% confidence levels.
Generalizing is the primary goal of Machine Learning algorithms, so the more data
available, the better are the chances of learning the correct pattern; however, limited data poses a
great challenge to the learning process. Estimating the accuracy of a model or ML scheme on the
training data is quite simple, but due to bias and variance in the estimate, we are unsure of the
accuracy of the model and method on future unseen data with unknown distribution (Mitchell
1997). Therefore, when only limited data is available, numerous metrics for comparing the
accuracy of two models and statistical tests for comparing the accuracy of two ML methods were
considered. Although there is ongoing debate regarding the best way to learn and compare
45
hypotheses from limited data, it is generally accepted that the metrics presented in this chapter
are the best ways (Witten & Frank 2005; Mitchell 1997) to evaluate models and ML methods for
continuous prediction.
5.1 MODEL EVALUATION METRICS
In order to evaluate the performance of the various models obtained from a learning
scheme, we used an independent test set, which has not only been absent from training but also is
the product of a separate composting experiment performed at a different temperature setting
(contains 1460 instances at 34°C). We felt that this approach would have the best objectivity
given the limited data and would simulate the real world situation. The ML methods are trained
on the 8760 patterns of training samples, and subsequently tested on the independent 1460 test
bed.
Several measures of the error estimates are used to remain objective in our evaluations,
namely Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Correlation
Coefficient (CC or r2), which are described below. The following notations are used in the
descriptions: pi – predicted value of the ith instance, ai – actual/observed value of the ith
instance, n – number of instances used in testing, and ∑=
=n
jja
na
1
1 .
• RMSE is the most commonly used measure, ∑=
−=n
iii ap
nRMSE
1
2)(1 ..................(5.1)
• MAE is another common measure, n
apMAE
n
iii∑
=
−= 1 …........................................(5.2)
46
• The Correlation Coefficient (r2) measures the statistical correlation between the actual
and predicted target class values. It has values between 1 and -1, where 1 indicates
perfect correlation, and 0 indicates no correlation. This measure is scale independent,
meaning that the difference between pi and ai does not matter as long as they move in
the same direction. A high r2 indicates good correlation between ai and pi implying
good performance.
5.2 MODEL DEVELOPMENT & EVALUATION
The three input variables (descriptors) that were used for all model developments were
temperature, moisture content and time during composting. In order to objectively and fairly
compare our results with those that have been published, we used a model development
methodology similar to Liang et al. (2003a). Model development for each method had two
phases: 1) the first stage consisted of selecting the preferred parameter settings for that method
based on minimizing the errors (RMSE, MAE) and maximizing r2, and 2) the second stage
consists of evaluating the model created using the preferred parameters against the separate
evaluation data set (this data set was not used during model development). The model
development dataset consisted of 8760 patterns for six temperatures and five moisture contents,
which was solely used to determine the preferred parameter settings for each method. Once the
preferred parameter settings were selected, the final model was built and then evaluated on the
evaluation data set of 1460 data patterns. The evaluation data set contained patterns at 34°C
temperature (this temperature setting was not present in the development data set) with five
moisture contents (30%, 40%, 50%, 60%, and 70%). For a more complete description, please
refer to the source of the data set, Liang et al. (2003a,b).
47
Different methods need different forms of data partitions, e.g. ANN, ANFIS use three
partitions while SVM, k-NN, RBF etc. use two. The ANN used the following: 1) a training set to
adjust the weights (i.e. learn) during the training process, 2) a validation set to evaluate the
accuracy of ANN models during training in order to determine when to stop training and avoid
overfitting, and 3) a testing set to evaluate the resulting model created by ANN. Other methods
like SVM, RBF do not make use of a validation set; they simply use training and testing sets as
previously described. Although literature may refer to these sets by different names, they are
essentially the same. The validation set and testing set as defined in this thesis are synonymous
to the testing set and production set respectively as used by Liang et al. (2003a). For all methods,
one replicate of 43°C (730 instances) was set aside as the testing set, while the remaining data
(8030 instances) was allocated for training. When required, the training set partitioning strategy
to create the validation set was chosen based on how the model created by a method performed
against the remaining test set (730 instances); the effects of choosing different training set
partitions for post-pruning in the Model Tree algorithm is shown in Table 5.1 (87% partitioning
strategy gave the lowest errors and is thus preferred). Next, the preferred parameters for the ML
method were selected based on the performance on the chosen testing set. An example of how
the preferred parameters were chosen for ANNs during the model development phase is shown
by Tables 5.2 and 5.3, which tabulates the effects of using different learning rates and number of
hidden nodes. Table 5.4 shows the effect of different k-values (number of nearest neighbors) for
models created by two hybrid lazy-eager learners based on the training-testing split mentioned
before. The shaded green cell indicates the selected value for a parameter that was chosen for
final model training.
48
Table 5.1: Effect of different training set partitioning strategies for Model Trees on the testing set. The 87% partitioning, which gave the lowest errors, was chosen.
Data partitioning (% in training set) CC MAE RMSE
50% 0.9432 0.0831 0.1303 66% 0.9441 0.0815 0.1294 70% 0.9473 0.08 0.1263 75% 0.9462 0.0811 0.1287 80% 0.9488 0.0794 0.1268 82% 0.9472 0.0805 0.1286 85% 0.9495 0.077 0.1247 87% 0.9507 0.0767 0.1233 90% 0.9478 0.0778 0.1263
Table 5.2: Effect of different learning rates (with decay) on model errors for ANN on the testing set during the model development phase. The value of 0.5 was chosen since it gave lowest errors.
learning rate Momentum training Set Size
(% of all data)
validation Set size (% of
training Set)
nodes in hidden layer
CC MAE RMSE
0.2 0.1 90% 20% 10 0.8957 0.1577 0.19850.3 0.1 90% 20% 10 0.857 0.1489 0.20290.4 0.1 90% 20% 10 0.8539 0.1501 0.20510.5 0.1 90% 20% 10 0.8961 0.1216 0.1750.6 0.1 90% 20% 10 0.8952 0.1295 0.18370.8 0.1 90% 20% 10 0.8866 0.1342 0.1837
Table 5.3: Effect of hidden node numbers on ANN model errors (using the testing set) during the model development phase. The value of 22 was chosen.
nodes in hidden layer learning rate momentum CC MAE RMSE
15 0.5 0.1 0.8824 0.1302 0.1854 18 0.5 0.1 0.8895 0.1276 0.1801 20 0.5 0.1 0.8966 0.1242 0.1812 22 0.5 0.1 0.8971 0.1211 0.1731 25 0.5 0.1 0.8899 0.1297 0.1799 30 0.5 0.1 0.8873 0.1281 0.1818
49
Table 5.4: Effect of different k-values on MAE for lazy-eager hybrid learners on testing set during the model development phase. Values of 15 and 20 were chosen for the two methods.
k-value Lazy model tree Lazy SVM
5 0.07822373 0.078247598
10 0.07768349 0.077152532
15 0.07538994 0.074708158
20 0.07689541 0.071553716
25 0.07684661 0.074557203
30 0.07719097 0.074925701
50 0.07885324 0.083115533
To aide in finding the preferred parameter settings during model development, programs
were written in Java (Java 1.5 2005) (Figure 4.3) using the WEKA API to automate parts of this
lengthy process. When applicable, literature recommendations were also used for parameter
tuning. The results of various runs are tabulated in Tables 5.1 through 5.4. When required by an
algorithm, preprocessing was performed, for example, normalizing for SVM, standardizing for
RBF, etc.
Table 5.5 shows the performance of the final models created by each method against the
evaluation dataset. The results indicate that models created by hybrid lazy-eager learners have
the best performances, followed closely by models created by lazy, tree based, and rule based
ML schemes. Kernel based methods like SVM, ANN and RBF network, usually exhibiting
superior results in other domains, surprisingly did not produce the best models.
50
Table 5.5: Summary of preferred models (after finding suitable parameters during model development phase) applied to evaluation data set (1460 instances). *Result from Liang et al.
(2003a), **Result from Morris (2005), - indicates not available, yellow shade indicates tree based method, green shade indicates rule-based method, bold red indicates best results.
Eager Learners MAE RMSE r2 SVM regression 0.1523 0.1864 0.8847 linear regression 0.2639 0.3687 0.7100
RBF Network 0.1230 0.1636 0.9329 ANN (1 hidden layer) 0.1368 0.1769 0.9226
M5Rules 0.0756 0.1180 0.9624 Model Tree 0.0740 0.1190 0.9657
Regression Tree 0.0856 0.1448 0.9114 ANN (Ward network with 3 slabs in hidden layer)* 0.110* - 0.9280
Lazy Learners k-NN (k=20) 0.0770 0.1282 0.9615
LW linear regression (k=30) 0.0764 0.1246 0.9624 Ensemble – Multiple Models
Model Tree w/bagging 0.0751 0.1211 0.9583 Regression Tree w/bagging 0.0798 0.1315 0.9474
k-NN w/bagging 0.0772 0.1289 0.9361 Hybrids
lazy SVM (k=20) 0.0709 0.1187 0.9661 lazy Model tree (k=15) 0.0755 0.1252 0.9616 lazy M5Rules (k=20) 0.0753 0.1244 0.9624
ANFIS 0.0771 0.1161 - FISSION** 0.1095 0.1411 0.9028
5.3 EAGER LEARNING METHOD RESULTS
The performances of models created by eager methods against the model evaluation set
of 1460 data patterns are graphically presented in this section. For each moisture setting, the
performance of the model was compared to the two sets of observed values, which is plotted in
the following graphs. The blue and green lines indicate observed values (two replicates) during
the experiment, while the red line shows the values predicted by the model. Good performance is
indicated by how well it is able to fit the observed plots.
The performance of the ANN model is shown in the Figures 5.1 to 5.4. As discussed
previously in Section 3.10, ANNs use the gradient descent algorithm in Backpropagation to learn
the target function by searching for the best weight vector. The possible weight values represent
51
the space of learnable functions. Given two positive examples with no negative examples
between them, Backpropagation will tend to label points in between as positive. This smooth
interpolation between data points is the inductive bias of Backpropagation learning algorithm
(Mitchell 1997). Using this analysis, we can observe that for a highly irregular complex instance
space, ANN will tend to smoothen unknown parts of the hypercurve. Recall that we
hypothesized an issue with the training data – it contained a few distinct attribute values spaced
at regular intervals. Due to the unavailability of points in between the discrete intervals, ANN
would tend to smooth out irregularities of the hyperspace, which might have been an important
part of the model. This effect is exhibited in the graph of Figure 5.4 as excessive smoothing.
ANN modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.1: Comparison between experimentally observed values (replicate 1 & 2) and
ANN predictions. (Patterns at 34°C and 40% moisture content).
52
ANN modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.2: Comparison between experimentally observed values (replicate 1 & 2) and
ANN predictions. (Patterns at 34°C and 50% moisture content).
ANN modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.3: Comparison between experimentally observed values (replicate 1 & 2) and
ANN predictions. (Patterns at 34°C and 60% moisture content).
53
ANN modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.4: Comparison between experimentally observed values (replicate 1 & 2) and
ANN predictions. (Patterns at 34°C and 70% moisture content).
The performance of RBF networks are plotted in the graphs shown in Figures 5.5 to 5.8.
The models in this method are built eagerly from local approximations centered around the
training examples, or around clusters of training examples, in which the RBF centers are placed.
A high number of kernel centers were used to follow the contour of the hypercurve, however,
after a certain number of kernels, it started overfitting and resulted in significantly lower testing
performance. Too many Gaussian kernels with small standard deviations would cause a good fit
of the samples, but would also render the jagged appearance as shown in the graphs of the RBF
network model performance. This adverse side-effect of closely fitting the testing set possibly
accounted for average evaluation set performance.
54
RBF Network modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.5: Comparison between experimentally observed values (replicate 1 & 2) and
RBF Network predictions. (Patterns at 34°C and 40% moisture content).
RBF Network modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.6: Comparison between experimentally observed values (replicate 1 & 2) and
RBF Network predictions. (Patterns at 34°C and 50% moisture content).
55
RBF Network modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.7: Comparison between experimentally observed values (replicate 1 & 2) and
RBF Network predictions. (Patterns at 34°C and 60% moisture content).
RBF Network modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.8: Comparison between experimentally observed values (replicate 1 & 2) and
RBF Network predictions. (Patterns at 34°C and 70% moisture content).
56
It was expected that SVM would perform well because it is a widely acclaimed and
powerful eager method for classification and function approximation; however, this was not the
case. The graphs of the model performances shown in Figures 5.9 to 5.12 are bad fits and too
smooth, which are unable to handle the irregularities and complexities of the target class, leading
to poor performance. The smoothness is possibly caused due to SVM’s mechanism of using
maximum margin hyperplanes based on support vectors to find classification boundaries (Witten
& Frank 2005). Once a non-linear mapping transforms the input space, the algorithm identifies
important points called support vectors (SV) in the new transformed feature space that defines
the classification boundaries (important points on the convex hull). The SVs are then used to find
a regression function which can tolerate up to a certain error value, ε. This essentially forms a
tube (Figure 5.13) around the target curve thereby creating a smoothing effect on the hypercurve
which is exhibited in the SVM graphs. In later experiments (please refer to Section 5.7), we
attempted to reduce this oversmoothing effect by minimizing the width of the ε-tube, but this
resulted in all training instances becoming support vectors (10,220 and 8,760) and minimal
predictive improvement. In the worst case scenario, if no error is tolerated, then the algorithm
will simply perform least-absolute-error regression (Witten & Frank 2005). Vapnik et al. (1997)
states that for even moderate approximation quality, the number of SVs can be considerably
high.
57
SVM modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.9: Comparison between experimentally observed values (replicate 1 & 2) and SVM regression predictions. (Patterns at 34°C and 40% moisture content).
SVM modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.10: Comparison between experimentally observed values (replicate 1 & 2) and SVM regression predictions. (Patterns at 34°C and 50% moisture content).
58
SVM modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.11: Comparison between experimentally observed values (replicate 1 & 2) and SVM regression predictions. (Patterns at 34°C and 60% moisture content).
SVM modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.12: Comparison between experimentally observed values (replicate 1 & 2)
and SVM regression predictions. (Patterns at 34°C and 70% moisture content).
59
Figure 5.13: The ε –tube in Support Vector Regression. ε is the precision (error margin) to which the approximated curve should fit the instance points. ξ is called a slack variable to
allow the curve to deal with erroneous instance points. (Smola & Scholkopf 2003)
The following graphs in Figures 5.14 through 5.17 show the performance of the models
created by Model Trees on the evaluation data set. According to the error metrics in Table 5.5
and the following figures, it was one of the best performing models for this domain.
Model Tree modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.14: Comparison between experimentally observed values (replicate 1 & 2)
and Model tree predictions. (Patterns at 34°C and 40% moisture content).
60
Model Tree modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.15: Comparison between experimentally observed values (replicate 1 & 2)
and Model tree predictions. (Patterns at 34°C and 50% moisture content).
Model Tree modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.16: Comparison between experimentally observed values (replicate 1 & 2)
and Model tree predictions. (Patterns at 34°C and 60% moisture content).
61
Model Tree modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.17: Comparison between experimentally observed values (replicate 1 & 2)
and Model tree predictions. (Patterns at 34°C and 70% moisture content).
The model created by the M5Rule algorithm was able to successfully handle this domain
as shown by the error metrics in Table 5.5 and the following graphs in Figures 5.19 through 5.22.
The M5Rule algorithm infers rules using a divide-and-conquer approach by iteratively creating
model trees. For each iteration, a model tree is built using the M5 model tree building algorithm
and it selects the best leaf as a rule. The M5Rule graphs illustrate the effect of rules for numeric
prediction. Unlike other methods, M5Rules do not produce smooth curves to approximate a
function, but rather a series of connected straight lines, which are a result of the rule firing
process. The advantage of using rule inferring algorithms is that they generate human readable
rules. Typically, the number of rules generated for this domain ranged from 20 to 200 based on
the parameter setting. Figure 5.18 shows two rules generated from the M5Rules algorithm. The
rules generated by the M5Rules algorithm placed moisture as the parent node and also as parents
for several other subtrees, indicating that moisture has the highest information gain and is
possibly the most significant attribute.
62
Rule: 14 IF moisture <= 45 time > 67.5 moisture <= 35 temp > 46.5 THEN o2rate= 0.0007 * moisture + 0 * time - 0.0004 * temp + 0.0151 [427/2.687%]
Rule: 15 IF moisture <= 45 Moisture <= 35 Temp <= 39.5 time <= 47.5 THEN o2rate= 0.0002 * moisture + 0 * time + 0 * temp + 0.0023 [1142/2.46%]
Figure 5.18: Sample rules extracted from the M5Rules algorithm which is based on Model-trees. From the several rules generated it is observed that moisture and time are
predominantly at parent levels. This indicates their importance over other factors.
Although the use of model and regression trees for solving complex real-valued function
regressions has been declining, trees performed exceptionally well for this dataset as shown by
the M5Rules and Model tree graphs. Intuitively, it seems that model trees should be limited in
their expressiveness since they merely use a set of linear equations to describe a complex
hypercurve. However, by splitting the instance space, and then applying necessary regression
equations to fit those subsets, trees are possibly able to capture the phases involved in
composting using a linear yet highly relevant approximation (this is further discussed in Section
5.8 and illustrated in Figure 5.44).
63
M5Rules modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.19: Comparison between experimentally observed values (replicate 1 & 2)
and M5Rules predictions. (Patterns at 34°C and 40% moisture content).
M5Rules modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.20: Comparison between experimentally observed values (replicate 1 & 2)
and M5Rules predictions. (Patterns at 34°C and 50% moisture content).
64
M5Rules modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.21: Comparison between experimentally observed values (replicate 1 & 2) and M5Rules predictions. (Patterns at 34°C and 60% moisture content).
M5Rules modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.22: Comparison between experimentally observed values (replicate 1 & 2) and M5Rules predictions. (Patterns at 34°C and 70% moisture content).
65
5.4 LAZY LEARNING METHOD RESULTS
Lazy learning methods produced remarkably accurate models that performed well on the
independent test data. This is in contrast to many domains where powerful eager learners like
ANN, SVM perform remarkably well. k-NN achieved a best MAE value of 0.0770 with a
correlation coefficient of 96.15%; Locally Weighted Linear Regression (LWR) achieved similar
MAE value of 0.0764 and correlation coefficient of 96.24%. Both these methods showed
approximately 30% improvement over previously published models using the same evaluation
set and training/testing partitions. The following graphs in Figures 5.23 through 5.26 show the
model performance of LWR on the evaluation data set. The graphs in Figures 5.27 through 5.30
show k-NN model performance.
LW Linear Regression modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.23: Comparison between experimentally observed values (replicate 1 & 2) and Locally
Weighted Linear Regression predictions. (Patterns at 34°C and 40% moisture content).
66
LW Linear Regression modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.24: Comparison between experimentally observed values (replicate 1 & 2) and Locally
Weighted Linear Regression predictions. (Patterns at 34°C and 50% moisture content).
LW Linear Regression modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.25: Comparison between experimentally observed values (replicate 1 & 2) and Locally
Weighted Linear Regression predictions. (Patterns at 34°C and 60% moisture content).
67
LW Linear Regression modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.26: Comparison between experimentally observed values (replicate 1 & 2) and Locally
Weighted Linear Regression predictions. (Patterns at 34°C and 70% moisture content).
kNN modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.27: Comparison between experimentally observed values (replicate 1 & 2)
and k-NN predictions. (Patterns at 34°C and 40% moisture content).
68
kNN modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.28: Comparison between experimentally observed values (replicate 1 & 2)
and k-NN predictions. (Patterns at 34°C and 50% moisture content).
kNN modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.29: Comparison between experimentally observed values (replicate 1 & 2) and k-NN predictions. (Patterns at 34°C and 60% moisture content).
69
kNN modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.30: Comparison between experimentally observed values (replicate 1 & 2) and k-NN predictions. (Patterns at 34°C and 70% moisture content).
5.5 COMBINGING MODELS – ENSEMBLE RESULTS
Although expected, our experimentation confirms that the eager learning schemes can
benefit from bagging, boosting and stacking, provided they are designed properly. Bagging
produced some improvements as shown in Table 5.6 using eager learners, but failed to do so for
lazy learners. Eager schemes already form their hypothesis before query time, resulting in an
inflexible model, which may be a disadvantage for certain domains. Bagging can provide more
diversity as a result of forming many hypotheses by 1) using different versions of the training set
by randomly sampling the entire data set and then training the algorithm, or 2) by relying on the
inherent instability of some learners, e.g. ANN, randomly seeded RBF network, etc.
When bagging is applied to lazy learners, it does not harness the power of using multiple
models because lazy learners only create their models during query time. During query time, a
bagged lazy learner will produce a multiple of k neighbors, which is essentially the same lazy
70
learning algorithm with a higher k-value. A higher k-value does not necessarily produce better
prediction results due the possible inclusion of irrelevant or noisy instances (this effect is
observed during parameter tuning and shown in table 5.4). Therefore, it is not surprising to
observe that bagging did not produce any improvement for lazy learners.
Because bagging stores the numerous models built in main memory, performance and
hardware can be an issue, as encountered during the experimentation. Model trees can grow to
enormous lengths with large training sets, and with many unpruned trees used in bagging, we
often ran out of main memory (1GB).
Table 5.6: Summary of MAE values for several ML algorithms with and without bagging. (ibk # = instance based learning, which is the k-NN learning using # neighbors)
No bagging bagging ML Scheme MAE MAE ibk 1 0.0792 0.0798 ibk 5 0.0782 0.0786 ibk 10 0.0776 0.0776 ibk 15 0.0771 0.0772 ibk 20 0.077 0.0772 ibk 25 0.0775 0.0774 ibk 30 0.0783 0.0787 ibk100 0.0932 0.0942 linear regression function 0.2639 0.2638 regression tree-pruning, unsmooth 0.0843 0.0794 regression tree-pruning, smooth 0.0937 0.0905 model tree-pruning, unsmooth 0.0768 0.0751 model tree-unpruned, unsmooth 0.0796 out of memory model tree-pruning, smooth 0.0778 0.0783 model tree-unpruned, smooth 0.0794 out of memory
Boosting methods like additive regression slightly improved the results of already well
performing learners, but were unable to help the other ones. Stacking theoretically should
perform no worse than its worst base classifier. This was confirmed when strong base classifiers
were used to give good performance and vice versa as shown in table 5.7. A major drawback of
stacking is the complexity and amount of time required for training.
71
Table 5.7: Sample results of Stacking, Boosting (additive regression) Method CC MAE RMSE time (s)
Stacking (Meta: Model Tree, Base: k-NN, Model Tree, RBF, LWR) 0.964 0.0751 0.1234 1634
Stacking (Meta: ANN, Base: RBF, ANN, SVM) 0.9281 0.1221 0.1682 6295
Boosting: Additive Regression: Decision Stump X10 0.823 0.2179 0.2618 8
Boosting: Additive Regression: Model Trees X3 0.9628 0.0777 0.1257 385
Boosting: Additive Regression: Model Trees X10 0.9624 0.0825 0.1316 1245 Boosting: Additive Regression: SMOreg X2 0.6511 0.2502 0.3406 1502
5.6 HYBRID METHOD RESULTS
Hybrid methods performed the best in this research, closely followed by rule based and
lazy learners. Based on unpromising results from SVM, ANN, and RBF, lazy learners were
applied yielding marked improvements. This inspired us to develop hybrid methods of some
eager learners. Eager methods other than trees did not perform satisfactorily on the dataset.
Therefore, we studied the effect of hybridizing an eager algorithm with a lazy learner. In
experiments described in Section 5.7, where we compared the performances of several
algorithms and effects of hybridization, we noted that the same eager learners (e.g. the kernel-
based methods) were able to substantially boost their performance once hybridized and perform
at par with other algorithms. Figure 5.31 shows their comparative improvement in performances
(please refer to Section 5.7 for details). Table 5.4 summarizes the effect of using different nearest
neighbor values on the hybrid algorithm during the model evaluation phase.
The graphs in Figures 5.32 though 5.35 present the performance of the model created by
lazySVM on the evaluation dataset, which exhibits superior function approximation as well as
low error values.
72
Improvements in Hybridizing Eager Learners
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
0.23
0.25
Model Tree Lazy ModelTree
SVM Lazy SVM ANN Lazy ANN Regr. Tree Lazy Regr.TreeML Scheme
Erro
r Met
ric
mean RMSEmean MAE
Figure 5.31: Effect of hybridizing an eager method is illustrated in terms of mean MAE and RMSE with standard deviations (over 30 bootstrapped runs, refer to Section 5.7 for details)
lazySVM modeling performance - 40%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.32: Comparison between experimentally observed values (replicate 1 & 2)
and lazySVM predictions. (Patterns at 34°C and 40% moisture content).
73
lazySVM modeling performance - 50%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.33: Comparison between experimentally observed values (replicate 1 & 2)
and lazySVM predictions. (Patterns at 34°C and 50% moisture content).
lazySVM modeling performance - 60%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
e
Observed1Observed2Predicted
Figure 5.34: Comparison between experimentally observed values (replicate 1 & 2)
and lazySVM predictions. (Patterns at 34°C and 60% moisture content).
74
lazySVM modeling performance - 70%M
0.00
0.20
0.40
0.60
0.80
1.00
1.20
1.40
1.60
1.80
2.00
0 50 100 150 200 250
Time (hrs)
Oxy
gen
Upt
ake
Rat
eObserved1Observed2Predicted
Figure 5.35: Comparison between experimentally observed values (replicate 1 & 2)
and lazySVM predictions. (Patterns at 34°C and 70% moisture content).
5.7 EVALUATING MACHINE LEARNING SCHEMES
When two models need to be compared, it is usually sufficient to evaluate them using an
unknown evaluation set or by performing k-fold cross validation. However, it is more
complicated to compare machine learning methods against each other, for example how would
one make the statement - algorithm A will perform better than algorithm B in this domain? The
problem arises due to several reasons: 1) there is a limited amount of evaluation data, 2) the
underlying distribution of the unseen instances is unknown, and 3) there is uncertainty involved
in target and attribute values. An algorithm performing better on the evaluation set is not enough
justification to claim superiority, because the difference in performance may be due to estimation
errors (Mitchell 1997), e.g. the training set or testing set may not be representative of the real
world.
A solution to comparing algorithms with limited data is to use n-fold cross validation,
however, statisticians have developed the bootstrap, which many agree to be a better estimation
75
method (Efron 1997). It is a strong statistical procedure for estimating the sampling distribution
of the data (Mitchell 1997), which can then be used to obtain the confidence levels of the
difference in true prediction error of the two methods. For the 0.632 bootstrap method, a new
training set is created by sampling the original dataset (which contains n instances) n times with
replacement. Because some elements in this second dataset will almost certainly be repeated, the
remaining instances that have not been picked are put in the test set (Witten & Frank 2005),
which would on average be 36.8% of the data. Although the bootstrap is an estimation method,
theoretical studies show that given enough samples, it approaches the true sample means and true
underlying distribution of the domain (Witten & Frank 2005).
In this thesis, the total number of instances is bootstrapped 30 times, resulting in 30 folds
of training and testing sets. Each machine learning scheme is run on each fold, for a total of 30
runs. To make a fair comparison and ensure more accurate results for the paired t-tests, the same
folds are used for each machine learning scheme. A total of 300 runs (30 folds x 10 schemes)
were performed and results are shown in Table 5.8. The mean performance measured in MAE
and RMSE over 30 folds for each algorithm is plotted in Figure 5.36 and the standard deviations
of the means are shown in table 5.9. Figure 5.37 shows the average correlation coefficient (r2) of
the schemes considered; in this case, performance is directly proportional to r2.
76
Table 5.8: Results gathered from 30 runs of the 0.632 bootstrap using all the instances. Different partitions were used for each of the 30 runs. For each algorithm, exactly the same bootstrapped
partitions/folds were used for all schemes in order to make a fair comparison. * best eager performer, ** best hybrid performer, ~worst hybrid performer
ML algorithm mean CC mean MAE mean RMSE RBF 0.9236 0.1106 0.1531 SVM 0.8896 0.1373 0.1809 ANN 0.9101 0.1254 0.1700
regression tree 0.9512 0.0698 0.1235 model tree* 0.9576 0.0695 0.1150
k-NN 0.9518 0.0688 0.1225 lazy SVM 0.9469 0.0638 0.1290 lazy ANN 0.9595 0.0617 0.1126
lazy model tree** 0.9629 0.0597 0.1077 lazy regression tree~ 0.9523 0.0673 0.1122
Table 5.9: Mean MAE and RMSE along with the standard deviations of the 30 runs.
Model Tree
Lazy Model Tree
SVM Lazy SVM ANN Lazy
ANN RBF Regr. Tree
Lazy Regr. Tree
k-NN
mean RMSE 0.1151 0.1077 0.2409 0.1290 0.1700 0.1126 0.1531 0.1236 0.1201 0.1226
σ RMSE
0.0027 0.0022 0.0027 0.0035 0.0092 0.0026 0.0026 0.0037 0.0030 0.0022
Mean MAE 0.0698 0.0597 0.1373 0.0638 0.1254 0.0612 0.1106 0.0701 0.0671 0.0682
σ MAE 0.0016 0.0012 0.0018 0.0015 0.0077 0.0012 0.0018 0.0019 0.0015 0.0011
Although the graphs in Figures 5.36 and 5.37 along with tables 5.8 and 5.9 provide a
good indication of relative algorithm performance, they are not sufficient to conclude that one
method is better than another. It is therefore necessary to find the statistical significance to
provide confidence levels. The next several figures starting at 5.38 show the results of the
statistical paired t-tests and present some interesting observations. The same 30 folds of data that
was generated from bootstrapping were given to each ML scheme. A method is considered better
than another if it achieved at least 95% confidence level on the paired t-tests using the 30 paired
values.
77
Performance Comparison of ML schemes
0.05
0.07
0.09
0.11
0.13
0.15
0.17
RBFSVM ANN
regr
essio
n tre
e
model
tree
kNN
lazy S
VM
lazy A
NN
lazy m
odel
tree
lazy r
egre
ssio
n tre
e
ML scheme
Perfo
rman
ce M
etric
mean MAEmean RMSE
Figure 5.36: Performance comparison of different ML schemes on the composting dataset.
Values are average of 30 runs on the same 30 folds of bootstrapped data.
Performance Comparison of ML Scheme
0.880.890.900.910.920.930.940.950.960.97
RBFSVM ANN
regre
ssion
tree
mod
el tre
ekN
N
lazy S
VM
lazy A
NN
lazy m
odel
tree
lazy r
egres
sion t
ree
ML Scheme
Corr
elat
ion
Coef
ficie
nt
Figure 5.37: Performance comparison based on Correlation Coefficient of different ML schemes.
Values are average of 30 runs on the same 30 folds of bootstrapped data.
78
mean MAE Comparison of Algorithms
Model Trees k-NN
0.045
0.050
0.055
0.060
0.065
0.070
0.075
1Mean MAE &
Standard Devations
MA
E Er
ror
Model Treesk-NN
mean RMSE Comparison of Algorithms
Model Trees k-NN
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.12
0.13
0.13
0.13
1
Mean RMSE & Standard Deviations
RM
SE E
rror
Model Treesk-NN
(a) Comparison of mean MAE values showing standard deviations.
(b) Comparison of mean RMSE values showing standard deviations.
Model Trees k-NN mean MAE 0.06966 0.06820 stdDev MAE 0.00135 0.00115 t-Stat (MAE) 6.83399 mean RMSE 0.12299 0.12255 stdDev RMSE 0.00291 0.00217 t--Stat (RMSE) 1.07112 t Critical Val (95% Confidence) 2.04523
Figure 5.38: Comparison of best eager learner to the worst lazy learner. Model Trees vs. k-NN. Although k-NN has lower mean values for MAE and
RMSE, only one measure (MAE) provides with a 95% confidence level that k-NN performs better than Model Trees. The performance based on RMSE is statistically insignificant as t-Stat (RMSE) is lower than t Critical Val (for 95% confidence). Therefore, it is statistically uncertain whether k-NN outperforms Model Trees.
79
mean MAE Comparison of Algorithms
Model Tree
Lazy Model Tree
0.045
0.050
0.055
0.060
0.065
0.070
0.075
1
Mean MAE & Standard Devations
MA
E Er
ror
Model Tree
Lazy Model Tree
mean RMSE Comparison of Algorithms
Model Tree
Lazy Model Tree
0.098
0.100
0.102
0.104
0.106
0.108
0.110
0.112
0.114
0.116
0.118
0.120
1
Mean RMSE & Standard Deviations
RM
SE E
rror
Model Tree
Lazy Model Tree
(a) Comparison of mean MAE values showing standard deviations.
(b) Comparison of mean RMSE values showing standard deviations.
Model Tree Lazy Model Tree mean MAE 0.06984 0.05972 stdDev MAE 0.00164 0.05972 t-Stat (MAE) 23.72485 mean RMSE 0.11513 0.10771 stdDev RMSE 0.00267 0.00218 t--Stat (RMSE) 10.70025 t Critical Val (95% Confidence) 2.04523
Figure 5.39: Comparison of the hybrid version to the original version: Model Trees vs. lazy Model Trees. With a 95% confidence level we can say
that Lazy Model Trees perform better than Model Trees in regards to both error metrics. t-Stat() > t Critical Val for 95% confidence level.
80
mean MAE Comparison of Algorithms
SVM
lazy SVM
0.045
0.055
0.065
0.075
0.085
0.095
0.105
0.115
0.125
0.135
0.145
1Mean MAE &
Standard Devations
MA
E Er
ror
SVM
lazy SVM
mean RMSE Comparison of Algorithms
SVM
lazy SVM
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
0.18
0.20
1
Mean RMSE & Standard Deviations
RM
SE E
rror
SVM
lazy SVM
(a) Comparison of mean MAE values showing standard deviations.
(b) Comparison of mean RMSE values showing standard deviations.
SVM lazy SVM mean MAE 0.13731 0.06378 stdDev MAE 0.00181 0.00146 t-Stat (MAE) 164.57321 mean RMSE 0.18093 0.12897 stdDev RMSE 0.00273 0.00345 t--Stat (RMSE) 62.07635 t Critical Val (95% Confidence) 2.04523
Figure 5.40: Comparison of the hybrid version to the original version: SVM vs. lazy SVM. With a 95% confidence level we can state that
lazySVM outperforms SVM in regards to both error metrics. t-Stat() > t Critical Val for 95% confidence level.
81
mean MAE Comparison of Algorithms
lazy SVM lazy Model Tree
0.045
0.050
0.055
0.060
0.065
0.070
1
Mean MAE & Standard Devations
MA
E Er
ror
lazy SVMlazy Model Tree
(a) Comparison of mean MAE values showing standard deviations.
mean RMSE Comparison of Algorithms
lazy SVMlazy
Model Tree
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
1
Mean RMSE & Standard Deviations
RM
SE E
rror
lazy SVM
lazy Model Tree
(b) Comparison of mean RMSE values showing standard deviations.
lazy SVM lazy Model Tree mean MAE 0.06378 0.05972 stdDev MAE 0.00146 0.00119 t-Stat (MAE) 29.12331 mean RMSE 0.12897 0.10771 stdDev RMSE 0.00345 0.00218 t--Stat (RMSE) 50.18343 t Critical Val (95% Confidence) 2.04523
Figure 5.41: Comparison of Learning methods: lazy SVM vs. lazy Model Tree. With a 95% confidence level we can state that lazyModel trees perform better than lazySVM
using both error metrics. t-Stat() > t Critical Val for 95% confidence level.
82
mean MAE Comparison of Algorithms
ANN
Lazy ANN
0.045
0.055
0.065
0.075
0.085
0.095
0.105
0.115
0.125
0.135
1
Mean MAE & Standard Devations
MA
E Er
ror
ANN
Lazy ANN
mean RMSE Comparison of Algorithms
ANN
Lazy ANN
0.000
0.020
0.040
0.060
0.080
0.100
0.120
0.140
0.160
0.180
1
Mean RMSE & Standard Deviations
RM
SE E
rror
ANN
Lazy ANN
(a) Comparison of mean MAE values showing standard deviations.
(b) Comparison of mean RMSE values showing standard deviations.
ANN Lazy ANN mean MAE 0.12539 0.06166 stdDev MAE 0.00774 0.06166 t-Stat (MAE) 44.79261 mean RMSE 0.16999 0.11257 stdDev RMSE 0.00924 0.00257 t--Stat (RMSE) 31.80514 t Critical Val (95% Confidence) 2.04523
Figure 5.42: Comparison of the hybrid version to the original version: ANN vs. Lazy ANN. With a 95% confidence level we can state that
lazyANN outperforms ANN using both error metrics. t-Stat() > t Critical Val for 95% confidence level.
83
5.8 ANALYSIS
It appears from our research that lazy learners, hybrid learners, (i.e. lazy versions of eager
learners, trees and rule-based system, e.g. neuro-fuzzy systems, M5Rules algorithm) and trees
are highly successful in modeling this dataset. Statistical significance tests confirm the relative
method performance for this domain. The remainder of this section explores the observations and
provides possible explanations and further discussions.
The training patterns obtained from the experiments were carefully analyzed due to the
unexpected behavior of the machine learning algorithms. As described in Chapter 2 of this thesis,
it was observed from the data that the composting behavior was not steady. Even though the
experiments were conducted in the same facility under similar conditions, the microbial activity
often changed erratically at certain temperatures and/or moisture contents. This behavior is
exhibited in the graphs in section 2.2.
Most microbial activity is pronounced near the high moisture area as illustrated in the 3D
plot in Figure 5.43. Low moisture contents show minimal activity regardless of temperature. This
indicates that moisture plays a more important role in composting. This was also confirmed from
the Rule Sets generated by the M5Rules algorithm, please refer to Figure 5.18. Liang et al.
(2003b), Miller (1992) confirms this observation about the significance of moisture content as
well. The peak of the graph indicates that a temperature of 30°-35°C and moisture content of
50%-60% is most conducive for composting based on these two variables only.
84
Figure 5.43: Reduced to 3 dimensions, this graph plots microbial activity (output) against
moisture (input2) and temperature (input3). It should be noted that this graph changes over the time required for composting a substance.
Figure 5.44: (left) The three phases in composting, A-initial or mesophilic phase, B- thermophilic or high-rate phase and C-curing or cooling phase as shown in
literature (EPA 1995; Haug 1993; Das et al. 1998). (left graph from EPA) (right) Composting behavior created by our model based on the training data.
Composting literature identifies three phases in composting as shown in Figure 5.44
(left); each phase is characterized by the prevalence of certain microbes, temperature, pH,
particle density/size, etc., chemical, physical and biological factors. Some researchers go further
and insert other phases in between the aforementioned phases, one such phase is the lag phase,
85
where microbial activity recedes temporarily until optimal balance for it to thrive is restored. In
Figure 5.44 (right), we can clearly see how our models roughly follow the experimental patterns,
and also show the possibility of more phases. It is clear that composting obeys certain patterns,
and the overall progress is governed by phases. Trees and rule-based systems shatter the instance
space into many parts based on the tests at the node or the rules, and then a function is used to fit
each shattered part of the instance space. Similar principles are employed by the hybrid learners;
by only considering the k-nearest instances, they are implicitly shattering the instance space.
Subsequently, function approximation methods, like ANN, SVM, regression, etc. are used to fit
each shattered space. Essentially, the hybrid learner and trees (rule-based systems) are using a
collection of locally approximated functions (one is using it implicitly, while the other is doing it
by rules) to enhance their expressive power and approximate the overall complex target function.
This analysis can help us understand why trees, rule-based systems and lazy learners performed
very well in this domain. Our work in modeling has revealed that many phases are involved in
composting and with the help of appropriate ML methods (data mining), we may identify more
phases, patterns, rules etc.
86
CHAPTER 6
CONCLUSION AND FUTURE WORK
In this research we have successfully been able to use ML techniques and hybrid
algorithms to yield highly accurate prediction results. From this study, we have also gained a
better understanding of the applicability of machine learning schemes to this domain. The effects
of hybridizing eager learners with a lazy approach have been studied using the 0.632
bootstrapping method. We have shown with 95% confidence levels or higher that the hybrid
versions of eager algorithms performed better than the original. Furthermore, the worst hybrid
method was able to outperform the best eager method with a 95% confidence level (refer to
Figure 6.1). Finally, the best model developed using the hybrid scheme had errors of MAE =
0.0709, RMSE=0.1187 and correlation coefficient of 0.9661 on the evaluation dataset, which is
more than a 35% improvement in MAE over published results. We conclude that hybridizing an
eager learner using a lazy approach can significantly boost performance in this domain.
This research further indicates that the following types of algorithms can perform well for
this domain: 1) rule-generating algorithms, e.g. neuro-fuzzy system (ANFIS), and M5Rules
algorithm, 2) algorithms that can learn complex functions and can change parts of their model on
the fly, e.g. hybrid eager-lazy learners, lazy methods 3) tree-based methods, e.g. regression &
model trees. Rules inferred from the M5Rules algorithm confirmed previous research findings
about the importance of moisture (Liang et al. 2003b; Miller 1992).
87
mean MAE Comparison of Algorithms
Model Tree
lazyRegTree
0.045
0.050
0.055
0.060
0.065
0.070
0.075
1
Mean MAE & Standard Devations
MA
E Er
ror
Model Tree
lazyRegTree
Figure 6.1: Performance comparison of the best eager method with the worst hybrid method: Model Tree vs. lazy Regression Tree. With a 95% confidence
level, the errors of lazy regression tree were lower than model trees. t-Stat(9.68) >critical t Value for 95% confidence level (2.04523).
ML methods are highly applicable to real world composting facilities, where 1) the
composting substance does not have to be analyzed beforehand, 2) data from pilot scale
experiments can be used, 3) physical, chemical, empirical rules from research can be assimilated
and 4) minimal understanding of the composting process is required. Composting runs can be
used to obtain the data, which can then be used to train machine learning algorithms to yield an
accurate model. The model created can subsequently be used in a support system to control the
composting process and boost its efficiency.
While this work has met its goal, we have identified several areas that merit attention for
further research. Because composting has several phases, a meta-learner could be used to
identify the phases and then apply suitable base-learners to each phase. Further research could be
conducted to identify the most suitable algorithm for each phase of the composting process. This
phase-based stacking approach may lead to more dynamic, robust and accurate models.
88
It would be of interest to create a model tree that generates the membership functions for
a Fuzzy Inferencing System (FIS) rather than regression leaves. These generated membership
functions may be able to automatically capture the rules of this domain without domain experts.
Another ML scheme that may be appropriate for this domain is a hybrid learning classifier,
where the M5Rule algorithm is used to generate rules, and then a Genetic Algorithm is used to
find the best population of rules to classify a given query instance.
89
REFERENCES
Aha, D.W., Kibler, D., Albert, M.K. (1991). Instance-based learning algorithms. Machine Learning,
Vol.6, pp. 37-66.
Atkeson, C., A. Moore, and S. Schaal (1997). Locally weighted learning. AI Review, Vol.11, 11-73.
Baker, J.R. et al. (2004). Evaluation of Artificial Intelligence based models for chemical biodegradability
prediction. Molecules, Vol.9, 989-1004.
Bentley, J. L. (1975). Multidimensional binary search trees used for associative searching. Commun.
ACM 18, 9 (Sep.), 509-517
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees.
Chapman & Hall, New York.
Brodley, C.E., Fern, X. Z. (2003). Boosted lazy decision trees. The Twentieth International Conference on
Machine Learning.
Broomhead, D.S., Lowe, D. (1998). Multivariable functional interpolation and adaptive networks.
Complex Systems, Vol.2, 321-355.
Bush, G.W. (2006). State of the Union Address by the President. Available from
http://www.whitehouse.gov/stateoftheunion/2006/index.html . Internet; accessed June 2006.
Das, K.C., Tollner, E.W. (1998). Development and preliminary validation of a compost process
simulation model. Proceedings Composting in the Southeast Conference and Expo. Sept. 9-11, 1998.
Efron, B. and Tibshirani, R.J. (1997), Improvements on cross-validation: The .632+ bootstrap method. J.
of the American Statistical Association, 92, 548-560.
Engelbrecht, A.P. (2002). An Introduction to Computational Intelligence. John Wiley & Sons Inc.
EPA (1995). Decision-Makers’ Guide to Solid Waste Management. EPA Publication Vol.2, EPA530-R-
95-023, 7-12.
90
EPA (1999). Biosolids Generations, use and disposal in the United States. EPA Report 1999, EPA530-R-
99-009, p.12-15, 27-35.
EPA (2003). Municipal Solid Waste in The United States: 2001 Facts and Figures. EPA Report 2003,
EPA530-R-03-011.
EPA (2006a). Environmental Protection Agency. Composting – Basic Information. Available from
http://www.epa.gov/epaoswer/non-hw/composting/basic.htm. Internet; accessed May 2006.
EPA (2006b). Municipal Sold Waste (MSW) – Basic Information. Available from
http://www.epa.gov/epaoswer/non-hw/muncpl/facts.htm. Internet; accessed May 2006.
Frank, E., Witten, I.H (1998) Generating accurate rule sets without global optimization. Proc. of the 15th
International Conference on Machine Learning. Morgan Kaufmann 144-151
Goldstein, N., Gray, K. (1999). Biosolids composting in the United States. 1999 BioCycle 40 (1), 63-75.
Gray, K. (1999). MSW and biosolids become feedstocks for ethanol. BioCycle, Vol. 40, No.8, 37-38.
Haug, R.T. (1993). The Practical Handbook of Compost Engineering. Boca Raton, FL, Lewis Publishers.
Holmes, G., Hall, M., and Frank, E. (1999). Generating rule sets from model trees. Proc 12th Australian
Joint Conference on Artificial Intelligence, Sydney, Australia, Springer, 1-12.
Hong, J. W. (1996). RBF Java Applet. MIT. Available from
http://diwww.epfl.ch/mantra/tutorial/english/rbf/html/index.html . Internet; accessed June 2006.
Hornik, K. (1989). Multilayer Feedforward Networks are Universal Approximators. Neural Networks,
Vol.2, 359-366.
Isozaki, H., Kazawa, H. (2002). Efficient Support Vector Classifiers for Named Entity Recognition.
Proceedings of the 19th International Conference on Computational Linguistics (COLING'02),
Taipei, Taiwan, 390-396.
Jang, J.S.R. (1993). ANFIS: Adaptive-network-based fuzzy inference system. IEEE Transactions on
Systems, Man and Cybernetics, Volume 23, Issue 3, 665–685
91
Liang, C., Das, K. C., McClendon, R. W. (2003a). Prediction of Microbial Activity during biosolids
composting using Artificial Neural Networks. American Society of Agricultural Engineers, 2003,
ISSN 0001-2351, Vol. 46(6): 1713-1719.
Liang, C., Das, K. C., McClendon, R. W. (2003b). The influence of temperature and moisture contents
regimes on the aerobic microbial activity of a biosolids composting blend. Bioresource Technology
86, 2003, 131-137, Elsevier Science Ltd.
McCartney, D., Tingley, J. (1998). Development of a rapid moisture content method for compost
materials. Compost Science and Utilization, 6(3), 14-25.
Miller, F.C. (1992). Composting as a process based on the control of ecologically selective factors. Soil
Microbial Ecology: Applications in Agriculture Environment Management, 515-544, F.Blaine-
Metting, ed. N.Y.:Marcel Dekker.
Mitchell, T. (1999). Machine Learning. The McGrawhill Companies.
Morris, E. (2005). FISSION: An Evolutionary Method for Fuzzy Learning. M.S. Thesis, Computer
Science, University of Georgia, Athens.
Nakasaki, K. and Akihito, O. (2002). A simple numerical model for predicting organic matter
decomposition in a Fed-Batch composting operation. J.Environ. Qual.,31, 997-1003
ORWARE (2001). ORWARE – A simulation tool for waste management. Eriksson, O., Frostell, B. et
al., Royal Institute of Technology at Stockholm.
Platt, J. (1998). Fast Training of Support Vector Machines using Sequential Minimal Optimization.
Advances in Kernel Methods - Support Vector Learning, B. Schoelkopf, C. Burges, and A. Smola,
eds., MIT Press.
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, Vol.1, 81-106.
Quinlan, J.R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Roehr, M. (2001). The Biotechnology of Ethanol: Classical and Future Applications. Wiley-VCH
92
Rosso, L., Lobry, J.R., Flandrois, J.P., (1993). An unexpected correlation between cardinal temperatures
of microbial growth highlighted by a new model. J.Theor. Biol. 162, 447-463.
Smola, A.J., Scholkopf, B. (1998). A Tutorial on Support Vector Regression. NeuroCOLT2 Technical
Report Series - NC2-TR-1998-030.
Stombaugh, D.P., Nokes, S.E. (1996). Development of a biologically based aerobic composting
simulation model. Trans. ASAE 39, 239-250.
Vapnik, V. (1999). The nature of statistical learning theory. 2nd Ed., Springer-Verlag, NY.
WEKA (2006). Waikato Environment for Knowledge Analysis. University of Waikato, NZ.
http://www.cs.waikato.ac.nz/ml/weka/
Witten, I., Frank, E. (2005). Data Mining-Practical Machine Learning Tools and Techniques. 2nd Ed.,
Elsevier Inc.
Xi, B., Wei, Z., Liu, H. (2005). Dynamic Simulation for Domestic Solid Waste Composting Processes.
The Journal of American Science, 1(1).
Zhou, Y. and Brodley, C. E. (1999). A hybrid lazy-eager approach to reducing the computation and
memory requirements of local parametric learning algorithms. The 16th International Conference on
Machine Learning, June 27-30, 503-512.
93
Top Related