Master research Internship Internship report Numerical ... … · In this internship report, we...


Transcript of Master research Internship Internship report Numerical ... … · In this internship report, we...

Page 1: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

Master research Internship

Internship report

Numerical rule mining for prediction of wheat and vine diseases

Domain: Machine Learning - Data Structures and Algorithms


Olivier Pelgrin


Luis Gal�arraga

Alexandre Termier


Page 2: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

Abstract: Wheat and vine are to very important types of crops in France. Unfortunately, bothsu�er from various diseases, especially Vine. The �ght against crops diseases is done by the useof, sometimes, dozens of phytosanitary products, which raise some concerns for public health andthe environment. In order to better adapt the use of these products, statistical models have beendeveloped to better understand the diseases dynamics and reduce the use of phytosanitaryproducts. In this internship report, we will present a data mining method capable of extractingsymbolic and numeric rules, Hybrid Rules, for wheat and vine diseases study. These rules areinterpretable and can be used to make predictions, like any other regression method.


1 Introduction 1

2 Bibliographic Study 2

2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.1.1 Attributes : Symbolic and Numeric . . . . . . . . . . . . . . . . . . . . . . . . 22.1.2 Pattern Mining : Frequent Itemsets and Closed Itemsets . . . . . . . . . . . . 2

2.2 Search in the Symbolic Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2.1 Mining Association Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2.2 Mining Association Rules under Incomplete Evidences . . . . . . . . . . . . . 32.2.3 Improving Pattern Mining Output . . . . . . . . . . . . . . . . . . . . . . . . 42.2.4 Adding new measures for pattern mining . . . . . . . . . . . . . . . . . . . . 6

2.3 Search in the Numerical Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.1 Kernel composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Applying Regression Methods to Agricultural Surveys . . . . . . . . . . . . . 8

2.4 Mixed approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.1 First Order Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4.2 Terminological Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.3 Pattern Aided Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4.4 Finding Exceptional Subgroups in Data . . . . . . . . . . . . . . . . . . . . . 11

3 Agronomic context : wheat and vine diseases 12

4 Mining Hybrid Rules 13

4.1 Existing Interpretable Models and their Problems . . . . . . . . . . . . . . . . . . . . 144.2 Method Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Extracting Rules Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4 Dealing with Numeric Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.5 Regression models for the rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.5.1 Regression model choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.2 The Over�tting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.5.3 Sparse Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 Rule Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.7 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


Page 3: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

5 Experimental evaluation 23

5.1 Datasets and competitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Accuracy analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245.3 Impact of the number of rules on the accuracy . . . . . . . . . . . . . . . . . . . . . 255.4 Impact of the discretization on the accuracy . . . . . . . . . . . . . . . . . . . . . . . 265.5 Running times analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.6 Noise sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

6 Result discussion 28

7 Conclusion and future works 29

Page 4: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

1 Introduction

The popularization of numeric techniques in many di�erent aspects of society has led to a rise ofavailable data in the past decades. For example, shopping companies record all the sales they make.For big companies, this results in the construction of huge databases, with millions, and sometimesbillions of records. In order to extract knowledge from such quantity of data, data science techniquesare needed. It is obvious that it would be impossible to go analyze all these data by hand. Becauseof the heterogeneity of the available data, and its potential needs, developing techniques to processlarge data has been one of the most active research �elds in computer science. The domains ofcomputer science dealing with this problematic are called data mining and machine learning. Datamining methods use a large range of techniques from di�erent �elds, such as statistical analysis,machine learning, and many more. Machine learning is a �eld of computer science that aims atlearning models and making prediction on data.

In this internship, we will analyze data collected from agricultural surveys in the context ofthe Bulletin de Sant�e du V�eg�etal (BSV). Since 2009, BSV contributes to deliver information onthe diseases dynamics at the regional scale in order to help farmers make decisions about, amongother things, the usage of phytosanitary products. Large volumes of data have been collected, anddespite this large amount of data, few attempts have been made to exploit it, with the exceptionof several statistical studies [1].During the internship, we will aims at applying data mining techniques on disease monitoringdata to predict the incidence of plant diseases. More speci�cally, we will use data about wheatplantations, and data about vineyards.

We will explore rule mining techniques that combine both numerical and symbolic data. Theseare called numerical rules. We called them hybrid rules in our work. As an example, a rule appliedto vineyards, and representing the number of leaves a�ected by a disease could be :

IF (phenological stage = "Closing of the grape") ^ (wind the previous month > 50m.s)

then Mildew intensity on leaves = 6:4 + 0:16� (days above 11C)

We can interpret this rule as "if the vine is on the phenological stage of grapes closing, and theprevious month was windy, then we are likely to have the mildew correlated with the number ofdays with a minimal temperature above 11�C".

Numerical data has been traditionally modeled with sub-�eld of machine learning which arethe regression models. However regression models are unable of taking symbolic data into account.Because numerical rules combines both numerical and symbolic data, numerical rule mining couldbe seen as an extension of regression models to deal with symbolic data.

The rest of this internship report is organized in as follow : The �rst section is a bibliographicalstudy of existing methods linked to our problematic. In the bibliographical part of this report wewill discuss methods to deal with symbolic values, then we will take a look at methods for numericalvalues and �nally we will study existing hybrid methods, capable of using both kinds of data.

The next sections describe the work done during the internship. We will �rst begin with somecontext on the issues with crop's diseases and the monitoring data used during this intership. Thenwe will describe our proposed method in three main parts. First we will explain how to mine therule's antecedent, we call them the rule conditions. Then we will focus on the regression models,in particular we are interested in sparse linear models. Finally, we will study the performance ofour method w.r.t state of the art competitors in the experimental evaluation section.


Page 5: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

2 Bibliographic Study

2.1 Preliminaries

2.1.1 Attributes : Symbolic and Numeric

In our context, an attribute is a data �eld representing a characteristic of a data object. Thereare mainly two types of attributes that are of interest for us, symbolic attributes and numeric

attributes.A symbolic attribute, also called categorical attribute, contains symbols or names. The values of

a symbolic attribute do not have an order and can represent things like categories, states, codes...,etc Even if the values of a symbolic attribute are numbers, mathematics operations on symbolicattributes are not meaningful.

A numeric attribute can contain integer or real values. A numeric attribute usually representa measurable quantity (e.g a temperature, a price,... etc)

2.1.2 Pattern Mining : Frequent Itemsets and Closed Itemsets

Pattern mining is a sub-domain of data mining. A pattern mining task consist in e�ciently �ndingregularities in data.

To perform a pattern mining task, we need to �rst de�ne what is a pattern (also called itemset).Let A = fi1; :::; ing be a set of items. An item can be any symbolic piece of data. X � A is anitemset. A transaction t, is a set of items, such that t � A. A database can be expressed as a setof transactions T = ft1; :::; tng.

The most common de�nition of a pattern is the frequent itemset. We denote t(X) = ft 2 T jX � tgthe set of transactions that contains the itemset X. The support of an itemset is jt(X)j. An itemsetis frequent if its support is superior or equal to a speci�ed threshold. This threshold is often ex-pressed as a percentage of the database size. The �rst work proposing a algorithm to mine frequentitemset is Apriori [2]. The main issue with mining frequent itemsets is that on large databases, theset of frequent itemsets is very large and often contains redundant patterns, which makes frequentitemsets hard to use in practice.

To deal with this problem, the concept of closed itemset was introduced. The set of closeditemsets is a subset of the set of frequent itemsets. A closed itemset can be de�ned as follow : apattern P � A is a closed pattern if there exist no pattern Q � A such that P � Q and t(P ) = t(Q).A closed itemset is largest itemset of an equivalent class. An equivalent class is a set of patternsEC = fX1; :::; Xng such that 8Xi 2 EC, t(Xi) = T with T � T , that is they represent the exactsame set of transactions.

2.2 Search in the Symbolic Space

The task of �nding interesting patterns in a dataset has been vastly studied in the �eld of DataMining. In particular, the domain of pattern mining provided many contributions to the explorationof symbolic spaces. In this section, we will study several techniques from the pattern mining domainthat search for di�erent types of patterns such as association rules. We will discuss techniques toimprove the "interestingness" of the output patterns, as well as new metrics to measure the qualityof a pattern.


Page 6: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

2.2.1 Mining Association Rules

Association rules de�nition

Besides frequent and closed itemsets. Another very common type of patterns is association rules.An association rule is an expression of the form A

s;c�! B where A and B are itemsets. While the

support is enough to measure the quality of a frequent or a closed itemset, it is insu�cient forassociation rules. The support is still a meaningful measure, but we also introduce a new measurecalled the con�dence.The support of an association rule is :

s = jt(A [B)j

This represent the number of transactions containing both A and B.The con�dence of a rule is :

c =jt(A [B)j


This correspond to the probability that a transaction that contains A also contains B.

Mining non-redundant association rules

Mining association rules consist of �nding all frequent itemsets, and then generating rules with highcon�dence (above a user speci�ed threshold). Because we generate association rules from frequentitemsets, they inherits the problems of frequent itemsets : they can be extremely numerous andredundant.

To solve this issue, M. Zaki [3] proposed a new method to mine association rules, based on closeditemsets. In this work he show several properties that can be used to eliminate redundant rules.Instead of generating rules from frequent itemset, rules can be generated from closed itemsets.

Let A,B be frequent itemsets, and cit(A), cit(B) their respective closure. They show that the

rule As;c�! B is equivalent to the rule cit(A)

s;c�! cit(B). Thus we can only consider closed itemsets

to construct rules. They also use the concept of minimal generators in order to make the rulesmore readable. A generator of an itemset A is an itemset A0 such that A0 � A and A0 and A havethe same support. Let G(A) be the set of generators of A, A0 2 G(A) is a minimal generator if ithas no subset in G(A). Minimal generators are useful as they can concisely represent all itemsetsof the same equivalent class (itemsets having the same closure).

[3] proposes an algorithm to e�ciently mine non-redundant association rules based on the min-imal generators of closed itemsets. The algorithm is capable of e�ciently �nd all non-redundantassociations rules between A and B closed itemsets, with A � B.

Link to our problem

The presented work to mine non-redundant association rules provide an e�cient framework to �ndrules in the symbolic space. In the context of our problem however, we want to mine numericalrules. The lack of support of numerical values by association rules mining algorithms make themunsuitable to solve our problem. However, some of their ideas, especially on the minimal generatorsto generate rules, may be of use for us to explore the symbolic part of our data.

2.2.2 Mining Association Rules under Incomplete Evidences

We show previously association rules in the context of classic databases. Associations rules are in-teresting because they are easily understandable, so we may want to apply association rules mining


Page 7: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

in others kind of databases.

Association rule mining in the context of Knownledge Bases

Association rules applied to Knowledge Bases (KBs). KBs doesn't present data in a statictabular way like classical databases. Instead KBs model data as binary relations between entities.An intuitive way to represent a KB is a graph. Another particularity of KBs is that they make anOpen World Assumption (OWA), which mean that any statement not in the KB is unknown. Mostdatabases operate under the Closed World Assumption (CWA). This mean that if a statement isnot present in the database, we can assume that it does not exist, because the database is assumedto be complete.

Under the OWA, data not present in the KB may or may not be true. So in order to mineassociation rules from such KB, we need algorithms capable of dealing with incomplete data.

Mining rules in knowledge bases

Gal�arraga et al proposed an algorithm capable of mining association rules under incomplete datain the context of ontological knowledge bases [4].

The proposed algorithm, AMIE, mines Horn rules from RDF KBs. Horn Rules are of the form~B ! r(x; y), where r(x; y) is an atom, and ~B a set of atoms of the form B1 ^B2 ^ ::: ^Bn.

In order to deal with the OWA, new measures are needed to express the quality of a rule. Intheir work they make the Partial Correctness Assumption (PCA) which mean that if the KB knowssome relation r for x then it knows all of them. This might not be true for all KB, but most of thetime it is a reasonable assumption. They also want to avoid redundant rules, so they introducedthe concept of closed rule. In this context closed rule mean that each variable present in the ruleneed to appear at least twice.

For example, lets consider the rule bornInCity(x; y) ^ isCapital(y; z) ! isCitizenOf(x; z).This rule is closed as x; y and z all appears at least two times. If we add the atom favoriteColor(x; r),the rule become bornInCity(x; y) ^ isCapital(y; z) ^ favoriteColor(x; r) ! isCitizenOf(x; z).Now the rule isn't closed because r only appear one time. We can see that the new atom doesn'tadd interesting information and will lower the support of the rule as it will now only match peoplethat have a favorite color r.

Their algorithm work in the following manner : At each step the algorithm output the foundedclosed rule if it is not pruned according to several measures, then it use operators to re�ne the rulesfor the next step.

Their algorithm is able to mine Horn Rule in RDF Knowledge base in an e�cient manner, andis able to deal with incomplete data.

Link to our problem

As with association rule mining in classic databases, this approach lack the ability to deal withnumerical values. The most important part of this work for our problem is the ability to work withincomplete data. We already know that we will need to deal with missing data.

2.2.3 Improving Pattern Mining Output

One of the main issues of pattern mining algorithms is their tendency to output a lot of patterns.It is common to have millions of patterns in the output. The use of the closure property to mineclosed itemsets can alleviate this problem, sometimes by several order of magnitude. However, even


Page 8: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

with closed itemset mining, the output can be very large.To reduce the size of the output, one may raise the minimum support threshold, this will necessarilylead to a smaller output. This method is not without problems as it is often di�cult to choose asuitable minimum support. Recall that for some applications, raising the support also means thatwe will �nd mainly very general patterns, which may not be interesting.

Mining the top-k itemsets of length min-l

To solve this problem, Wang et al propose to change to data mining task [5]. Their main idea isto mine the top-k frequent itemsets of a minimum size. k is an user speci�ed input, that de�neminimum size of an itemset.

To solve this task, they propose a new algorithm, based on the FP-Close algorithm [6]. FP-closeis an frequent closed itemset mining algorithms that use a data structure to represent itemsets,called the pre�x tree [6] . Figure 1b illustrate the concept of pre�x tree.

(a) Transaction database (b) Pre�x tree

Figure 1: Dataset and pre�x tree

The pre�x tree concisely represent the patterns of a dataset. Each node in the tree correspondto an item its support. Each path from a node to one of the nodes below corresponds to a patternwhose support is contained in the lowest node. For example, in �gure 1a, we can see that thepattern fa; b; c; e; hg only occur one time. In �gure 1b, this pattern is represented by the path froma to h, and the support of the pattern is the support of the node h, which is 1.

In order to mine the top-k patterns of size min-l, they propose an algorithm adapted fromFP-Close. To be e�cient, the algorithm can now automatically raise the minimal support duringthe pre�x tree building.There are two ways to automatically raise the minimal support : the �rst one is to raise the mini-mum support according to the number of closed nodes in the pre�x tree. A closed node is a nodewhose support is higher than the sum of support of its children. In �gure 1b these are the nodeswith a dashed line.The algorithm only takes into consideration closed nodes. For example, if we want to mine thetop-k itemsets, the minimum support can be raised to the number of closed nodes with a supporthigher than k, if this number is no less than k.The other method to raise the minimum support is the descendant sum. It consist of adding up


Page 9: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

the support of the nodes below an anchor node. An anchor node is a node above the level min-l�1(min-l is the minimum desired length of an itemset). In �gure 1b, this level is illustrated by theL-watermark. So the anchor node in this example is a. For the descendant sum we add up thesupport of all distinct nodes below the anchor node. For example, in �gure 1b, the node d appearson the right branch and also in the left branch of a's sub-trees. Its support is then 3+1 = 4. Oncewe have the list of the nodes ordered by their support, we can raise the minimum support to thevalue of the kth most frequent item.

Link to our problem

As with other pattern mining algorithms, the absence of support of numerical values make it unableto solve our problem. However, their methods to dynamically raise the minimum support may beuseful for us.

2.2.4 Adding new measures for pattern mining

Until now we were mostly interested in the frequency of apparition of an itemset in a dataset. Butwe might want others measures to evaluate an itemset, and not just its support. Li et al proposedto mine itemsets with respect to two new measures, the Odds Ratio (OR) and the Relative Risk(RR) [7]. Now a itemset is still evaluated w.r.t its support but also its OR or its RR. OR and RRare commonly used measures in statistical studies to evaluate the relation between and e�ect anda possible cause. In a disease study of groups of person, OR and RR would be de�ned as :



RR =



where DE is the number diseased person exposed to a factor, DN the number diseased person notexposed to the factor, HE the number of healthy person exposed to the factor, and HN the numberof healthy person not exposed to the factor.

Mining Discriminant Itemsets

This work is interesting because it leads to a new kind patterns, discriminative patterns. In dis-criminative patterns mining, we are no longer interested in the frequency of a pattern, but insteadwe want to �nd patterns that are representative of a target class.

In order to mine itemsets w.r.t OR and RR, the dataset must have a target class. This targetclass may indicate, for each transaction, the presence of a certain phenomena (a disease for example).In this context, the class has only two possible value : positive or negative.

The main idea is to identify all the equivalent classes of the data. Remind that an equivalentclass is all the itemsets appearing in same transactions. An equivalent class can be conciselyrepresented by its borders : its generators and its closed itemset. Then for each equivalent class,compute the OR (RR), the support for the class having a positive value and the support for theclass having a negative value. If the computed OR (RR) is above a �xed threshold, then thisequivalent class, its OR (RR), its positive support and its negative support can be output.

The challenge is to e�ciently �nd the equivalent classes. Li et al propose two di�erent methodsin their paper : mine the closed itemsets and the generators separately, or mine both in the same


Page 10: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

algorithm. The later is the most e�cient way. They propose two algorithms, one for each approach.Both algorithms are based on the idea of pre�x tree.For the �rst approach, they use FP-Close to �nd the closed itemsets, and they propose a newalgorithm, GR-Growth, to �nd the generators.For the second approach, they propose a new algorithm, GC-Growth, capable of �nding both theclosed itemsets and the generators.

Link to our problem

In our context, we will study plants diseases. We might want to �nd symbolic values that arerepresentative for a speci�c kind of disease.

2.3 Search in the Numerical Space

Analyzing numerical attributes often means to learn how the numerical values "behave" in thedata. Regression is the �eld of machine learning that aims at explaining the values of a targetnumerical attribute as a function of the values of others numerical attributes. Regression methodslearn models to represent a numerical attribute. Such models can be used to describe the behaviorof the numerical attribute and also to predict new values for new data. In this section we will discusstechniques base on kernel regression, and how regression techniques can be applied to agriculturalstudies.

2.3.1 Kernel composition

There exist two types of regression, parametric and non-parametric regression. Parametric re-gression consists of learning from data the parameters of a �xed model. New values can then bepredicted by using the model with its learned parameters.

In non-parametric regression, there is no �xed model. Instead the model is derived from thelearning data. There are several di�erent non-parametric regression algorithms. The main ideabehind most of them is to express values as a weighted average of the neighbors values. Often, akernel is used as a weighting function.The choice of the kernel function plays a crucial role in the accuracy of kernel based non-parametricregression techniques. We will study a work by Duvenaud et al. [8] that choose a kernel for non-parametric kernel based methods.

Automatic kernel composition

It is possible to use a �xed kernel for non-parametric regression, but it often won't be adapted tothe problem. Instead we may want to adapt the kernel to the learning problem.The work of Duvenaud et al provides an automated way to chose a kernel for non-parametric re-gression. They formulate the choice of kernel by a search in a space of kernel structures de�ned bysum and product of a small number of base kernels. The idea is to automatically combine kernelsby using the sum and the product operator to create more complex kernels. The base kernels usedfor composition are uni-dimensional simple kernels. There are two operators to compose kernels :the sum, which can express the superposition of di�erent process, and the multiplication that canexpress interactions between di�erent input dimensions.

The proposed algorithm uses three search operators to navigate the space of kernels structures :any subexpression S, i.e. a kernel, can be replaced by S+B with B a base kernel, any subexpression


Page 11: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

S can be replaced by S � B, and any base kernel can be replaced by another base kernel.The algorithm operates in a greedy way. At each stage it chooses the best scoring kernel w.r.t

a quality measure. Then it expands this kernel by applying all possible operators. The algorithmcontinues until the kernels performance doesn't improve, or the maximum number of steps has beenreached.

They show in their experiments that this method can �nd good kernels to model many di�erentprocesses, performing comparably to kernels manually tuned by experts.

Link to our problem

The methods to explore a space of kernel functions by using operators is really interesting for us.We could use this idea to explore our numerical space to �nd the best model to �t the numericalvalues.

2.3.2 Applying Regression Methods to Agricultural Surveys

In many countries, agricultural surveys are done annually to detect the presence of diseases andpests. These surveys can help farmers to make decisions about the use of treatment in theirplantations.

Thanks to these surveys, there is a lot of available data concerning the incidence of diseasesover several years in many di�erent sites. Despite the availability of such data, few attempts havebeen made to model the dynamics of pest and diseases on crops.

Generalized Linear Mixed Models to model crop's pest and diseases

During her thesis, L. Michel [1] developed a framework based on Generalized Linear Mixed Models(GLMM) to make a better use of these agricultural surveys. GLMM can be viewed as an extensionto generalized linear models. In this thesis, she study data containing information about pests orcrop's diseases on several site-years. The GLMM used in their framework is expressed as :

g(�ij) = �i + �itij +KXk=1


Where i is the number of the observed site-year, j is the number of the observation, t the time of theobservation, �i and �i random variables that model variance and missing values in the measures.

Z(k)i are variables characterizing the ith site with its associated regression parameters k, and �ij

the expected value.The proposed framework contains the following steps : �rst the GLMM need to be adapted to theconsidered data. Then the model is �tted to observations up to a certain time. Once �tted, themodel can be used to study pest/diseases dynamics and also predict future values.

They evaluated their framework on several datasets. They show that their framework is suitableto model diseases and pest dynamics on many di�erent kind of crops.

Application to our problem

This work shows that regression methods can be used to model the dynamics of diseases and peston di�erent types of crops. It is of interest for us because we will work on very similar data.


Page 12: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

2.4 Mixed approaches

In order to deal with data that contains both numerical and symbolic values, techniques havebeen developed to take into account both types of values. Several di�erent approaches have beendeveloped in this context, some adapting well known machine learning algorithms and others usingnew data mining approaches with numerical attributes. We will studies these di�erent techniquesin this section.

2.4.1 First Order Regression

One of the �rst work towards the exploration of both numerical values and symbolic values is the�rst-order regression by A. Karalic and I. Bratko [9].Their work propose to combine Inductive Logic Programming (ILP) with numerical regression.ILP is a �eld of machine learning which uses �rst-order logic statements to represent examples,background knowledge and hypotheses. A �rst-order logic statement in the ILP context can be ofthe form ~B ^ ~A ! r(x; y) with r(x; y) a predicate, ~A a set of predicate in the form A1 ^ ::: ^ An

and with ~B the background knowledge also in the form B1 ^ ::: ^Bn.At the time of publication, most ILP systems, just like data mining algorithms, were unable to dealwith numerical values.

The �rst-order regression learning system

In First-Order Regression (FORS), they extends there use of ILP for the target predicate to benumerical attribute that is modeled via linear regression. FORS induce clauses in the form

f(Y; V1; :::; Vn) :- literal1; :::; literaln; !:

Here V1; :::; Vn are variables and Y is the target variable, also called class variable. The literals canbe any background knowledge predicate, a variable value is a literal in the form V ariable = value fordiscrete variables and in the form V ariable � const or V ariable � const for real-valued variables,a recursive call to another clause, or a regression literal of the form Y is Exp with Exp a numberor a linear function of real-valued variables. Clauses can only cover positive examples. FORS willconstruct clauses to partition the search space into subspaces that will be modeled via regression.

The algorithm works as follow : it covers learning example with clauses until there isn't su�-ciently many examples to continue, then it will add a default clause that always succeed to coverthe remaining examples. Clauses are constructed using a beam search. To specialise clause for thenext iteration, FORS will �rst try to expand it with background knowledge literals, then it will tryvariable values literals and �nally it will try recursive literals. When a clause is constructed, it willremove the redundant literals.

Relation with our problem

While this methods succeed at combining numerical and categorical balues, it has some limitationsin our context. First, it can only use linear regression. Second, it require background knowledgeas inputs. In our case we would want our system to work automatically, so without preliminaryknowledge about the data.Also, FORS is implemented with Prolog, which is not focused on scalability.


Page 13: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

2.4.2 Terminological Regression Trees

Another way to tackle the problem of dealing with both numerical and symbolic values is to adaptexisting machine learning algorithms. N. Fanizzi et al used regression trees on ontological knowledgebases [10].

Regression trees are based on the concept of decision trees. A regression tree consists of a treein which inner nodes contain tests on a attribute whereas the leaves contain a regression model.To perform the prediction for a transaction, we go through the tree from the root to the leaf, ateach node we perform the test w.r.t the value of the attribute in the transaction and we go eitherto the left or to the right sub-tree depending on the result of the test. When we reach a leaf, thevalue of that leaf is the predicted value for the transaction.

Terminological Regression Trees are an extension of regressions trees. The main di�erenceconcern the kind of test for the nodes. Because Terminological Regression Trees have been designedfor on ontological knowledge bases, the node's tests are expressed with Description Logic (DL)concepts. DL concepts can also express tests on several attributes, which makes the node's testmore expressive compared to standard regression trees.

Similarly to decision trees, it is easy to build rules with Terminological Regression Trees. Eachpath from the root to a leaf de�nes a rule. This is useful, especially because rules are much morereadable than trees.

Relation with our problem

We have seen that we can easily build rules from Terminological Regression Trees. The main prob-lem is that DL cannot handle numerical attribute. That means that the node's test also cannothandle numerical attribute. This mean that the antecedents of rules cannot contain conditions onnumerical attributes.

2.4.3 Pattern Aided Regression

We've seen that pattern mining methods are really e�cient at dealing with symbolic attributes,but are unable to deal with the numericals one. In this section we will discuss about Pattern-AidedRegression [11], a novel work that combines pattern mining algorithms with regression methods.

The authors of this work observed that in many real world datasets, di�erent parts of the databehave di�erently w.r.t a target attribute. Regression methods do not take this into account, andthus learn a global model that �ts at best the data. This may lead to very complex models thatare di�cult to understand.

To tackle this problem, they introduce the concept of Pattern-Aided Regression (PXR) andContrast Pattern Aided Regression (CPXR).

Pattern-Aided Regression and Contrast Pattern Aided Regression

They �rst introduce the concept of PXR. A PXR model consist in several patterns (itemsets) rep-resenting a subset of the data, and an associated local regression model, often a linear function.This capture the fact that come parts of the data may have a di�erent behaviours w.r.t the targetattribute.

CPXR is very similar to PXR, but instead of using standard patterns to describe the data, ituse contrast patterns, which is another name for discriminative patterns. The patterns used inthis work can deal with numerical attributes to describe the data. This is done by discretizing the


Page 14: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

numerical attributes via entropy based binning. The main idea of CPXR, is to use discriminativepatterns to describe parts of the data where the baseline regression model makes a large error.

The CPXR algorithm works as follows. It will �rst divide the data into two subsets w.r.t abaseline regression model, Large Error (LE) and Small Error (SE). Then CPXR will mine thediscriminative patterns for the LE subset, and learn a local regression model for each of them. Itwill remove patterns that are similar to others patterns, and patterns that do not have a signi�canterror reduction w.r.t the baseline model. Then CPXR will iteratively improve the patterns in orderto maximize the total error reduction of the pattern set representing LE. Total error reduction

measures the total residual reduction achieved by the pattern set model relative to the baselinemodel. Each learned model are combined to form a new global model.

CPXR is able to consistently outperform most regression algorithms on several real world andsynthetic datasets.

Relation with our problem

CPXR is a very e�cient framework capable of dealing with both numerical and symbolic attributes.Despite its e�ectiveness, the scope of this work is to reduce the error made by the regression models,not building interpretable numerical rules. It is still a very interesting work as it demonstrates away to use data mining methods together with regression methods.

2.4.4 Finding Exceptional Subgroups in Data

In the Pattern-Aided Regression paper [11], they use discriminative patterns to describe data thathave an exceptional behaviour w.r.t a target attribute. Subgroups Discovery (SD) is a �eld ofdata mining that is speci�cally interested in �nding subgroups in data that have an exceptionalbehaviour w.r.t the rest of the data. However SD is limited to exceptional behaviour w.r.t a singletarget attribute.

One may want to �nd subgroups of data that have an exceptional behaviour w.r.t several at-tributes. W. Duivesteijn and A. J. Feelders proposed more general framework dedicated to thistask [12], Exceptional Model Mining (EMM).

The Exceptional Model Mining Framework

EMM is a framework capable to �nd subgroups that exhibit an exceptional behaviour on severaltarget variables. Several target attributes are selected, and a model is chosen on these target at-tribute to be the target concept. In EMM, there is no restriction to the types of attributes thatcan be used to learn the target concept. The user is free to de�ne what "exceptional" means in hisproblem.

EMM use a user-speci�ed Description Language (DL) to describe the data. However in [12]they propose a DL that consists of logical conjunctions of conditions on single attributes.Finally, EMM needs a quality measure to evaluate the quality of a subgroup based on the selectedmodel. This choice is also left to the user.

The EMM algorithm follows a beam search approach to �nd exceptional subgroups, as the setof possible subgroups of the data can be extremely large, an exhaustive search is not viable. Asin standard beam search, EMM will re�ne the descriptions of the current level to generate thedescriptions of the next level. To do so, EMM de�nes re�nement operators on the DL. In the paperthey propose three operators to deal with symbolic and numeric attributes. At each stage, EMMwill keep the w best descriptions w.r.t the chosen quality measure, and improve them using the


Page 15: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

re�nements operators. EMM stops when it reaches the maximum allowed depth d. Then it willoutput the q best descriptions found. w, d and q are all user speci�ed parameters.

Relation with our problem

EMM is a very general framework to �nd subgroups exhibiting exceptional behaviors in data. Inour setting, we want to have rules that cover most of the data, so we aren't only interested inexceptional behaviors. However, we will need to take exceptional behavior into account. High levelof disease can be exceptional, but are of interest for agronomists. Moreover, EMM propose inter-esting way to explore the search space with the use of a description language and the associatedre�nement operators.

3 Agronomic context : wheat and vine diseases

Our work takes place in the context of crop protection for wheat and vine production. These twokinds of crops su�er from di�erent diseases and with di�erent intensities, but both are of a greatimportance.

Vine is a very important culture in France. Indeed, France, Italy and Spain, are the countriesin the world that produce and export the most wine. In France, vine crops represent a surfaceof 0.8 millions of hectares. There is therefore a great interest in maintaining a large and constantproduction of wine in France. For this purpose, vine crops must be protected against anything thatcould alter yields and quality. Unfortunately, vine is one of the most disease-prone crops. Vine canbe attacked by several di�erent diseases like mildew or blackrot. The most serious of them is themildew disease.

The presence of mildew can reduce the number of grapes usable for wine production, and ifsome a�ected grapes are used to make wine, this can a�ect the quality. To �ght vine disease,farmers often need to use dozens of phytosanitary products on vine. The Treatment Frequency

Index (IFT), which count the number of reference dose of phytosanitary products applied by cropshectare during a cultural campaign, can be up to 21.4. As comparison, this number is much loweron wheat, with an IFT of 4.9. More than 50 % of the phytosanitary products used on vine are tocontrol mildew.

Wheat is the most important culture in term of surface in France. There is around 5 millionshectares of wheat �elds in France. Even if the number of products used on wheat is much lowerthan on vine, the size of the cultivated area make it of a major importance. So there is, like onvine, incentives in reducing the amount of phytosanitary products used.

This raise some questions on the risk for the public health, as well as for the environment.Because of theses concerns, the French ministry of agriculture created several tools with the goalof reducing the use of phytosanitary products.

The Bulletin de Sant�e du V�eg�etal (BSV) is one of these tools. The goal of the BSV is to providea public and free risk analysis weekly for all majors crops.

This include observation of symptoms, presence and development stage of diseases. BSV's dataare collected through an observation network of scattered parcels. All the data available in theBSV are collected thanks to manual observations by the IFV (Institut Fran�cais de la Vigne et duVin), Arvalis (Institut du V�eg�etal) and Acta (Les Instituts Techniques Agricoles). Experts need tomake measurements on vine crops at regular intervals, with a target of one observation per weekduring growth season (April to August for vine). There is two types of observed vine parcels : The


Page 16: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

(a) Leaf a�ected by mildew (b) Grape a�ected by mildew

Figure 2: Mildew on vine

T�emoin non trait�e (TNT), in english untreated control. TNT parcels are vine parcels in which nophytosanitary products are used. They serves as comparison and to gather information on diseasesdevelopment. An illustration of TNT parcels can be seen in �gure 3b. The other kind of observedparcels are the Parcelle R�ef�erence, in english Reference parcel, which are parcel treated like anyother vine parcel. In the case of vine, all of the observations made on both kind of parcels are storedin a database called Epicure and managed by IFV. We were granted an access to this database inthis internship.

Because of this complexity, BSV's vine data are expansive to collect. It is therefore importantto make the best use of this data in order to make their acquisition cost worth it. Di�erent methodshave already been used with theses data, including statistical analysis and modeling [1]. In thenext section, we will present the data mining method we developed to make use of these data.

4 Mining Hybrid Rules

In this section we will present the method we developed during this master internship. Oftenaccuracy and interpretability are two faces of the same coin, very few machine learning methodsare able to provide both. Machine learning methods achieving the best accuracy are usually far frominterpretable. This include Multi-layers Neural Networks that are capable of impressive accuracyresults, but from which it is hard to interpret a prediction. More generally, we can say that themore accurate is a model, the more di�cult it is to interpret.

Our method have been developed with interpretability as primary goal, but also to have a goodaccuracy. Obviously, we won't try to compete with neural networks on the accuracy, but more to


Page 17: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

(a) Observation network in the Aquitaine region (b) Untreated observed vine crops

Figure 3: Observation network and vine crops

improve the accuracy achievable with interpretable models.

4.1 Existing Interpretable Models and their Problems

The primary focus of this work is the use of agricultural data to describe and predict crops diseases.Until now, these data have been mostly studied by agronomists. The primary goal is to use thedata to help farmers use an appropriate amount of phytosanitary products on the crops, andalso to understand the dynamics of the disease in order to prevent them. To ful�ll these goals,agronomists need interpretable methods capable of providing a description of the data and models,as well as making predictions. In order to be as helpful as possible, interpretables models shouldbe able to provide an explanation for all their predictions. In the case of mildew, current analysismostly involve statistical models that use a limited number of weather variables in order to makeprediction on the disease severity. The downside of these methods is that they focus on a reducedset of predictive variables that have been chosen by hand by agronomists on the basis of theirknowledge of the disease as well as their intuition. So there is a great interest in using moreautomatic methods capable of using a larger set of predictive variables and providing a way topredict the disease severity.

Predicting the disease severity can be presented as a regression task (disease severity is anumerical value representing the percentage of crops in a �eld that are a�ected by a particulardisease). Fortunately, regression is a well known task in machine learning, and many algorithmscan be used to make predictions. Because we want to describe the data and make predictions,agronomists won't be interested in "black box" algorithms that cannot provide explanations fortheir predictions.

The most obvious choice of regression method, corresponding to these criterions are the Re-gression Tree [13]. The study of the regression tree obtained after the learning phase can providesome insight on the data and the important predictive variables. Following the path from a leaf


Page 18: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

to the root of the tree will provide some kind of explanation for the leaf's prediction, and can bepresented as a rule. Sadly, regression trees can quickly become extremely large with thousands ofleaves making them hard to interpret. They also have a tendency to over�t the data if we don'tconstrain their depth. Overall, regression trees provides a trade-of between accuracy (large, buthard to interpret trees) and interpretability (small, but often inaccurate trees). There are otherkinds of trees usable for regression tasks such as Regression Models Trees [14], but even if theytend to outperform standard regression trees, they are also much harder to interpret.

Another method that we already presented in the bibliographical part of this report is ContrastPattern Aided Regression (CPXR) [11]. To our knowledge, CPXR is the �rst work making use ofpattern mining methods along with regression models. This method is able to provide interpretablemodels thanks to its use of patterns, and show impressive results on the accuracy. Let's quicklyrecall the steps of the CPXR method : CPXR will identify the di�cult part of the input data byidentifying the part where a simple linear model make a big error w.r.t an error tolerance parameter.The data is split into two classes, Small Error and Large Error, based on this error tolerance and theerror of the reference linear model. Then CPXR will try to correct the error by using discriminativepatterns to characterize the di�cult part (Large Error class). For each pattern a local regressionmodel is �tted. We can see that CPXR will not try to provide any information on the easiest partof the data, consequently that part will only be represented by an unique regression model. Wheninterested in providing descriptive models, we may want to keep all the parts of the data and notrely on a default model for a large part of it.

This kind of approach mixing patterns and regression models seems really promising to o�erinterpretable models for regression problems. Unfortunately, we can note that no CPXR imple-mentations are publicly available which will force users to make their own implementation.

We propose a new method, inspired by the CPXR's principles, mixing pattern mining andregression models to provide a description of the data as well as providing a way to make predictions.While being inspired by CPXR, our method use a di�erent and more exhaustive search strategy.

4.2 Method Overview

Our primary goal is to provide interpretable models. We believe that rules are among the easiestto interpret models. A rule is composed of two main parts :The tail, a condition describing a part of the data. This condition can be a conjunction of simpleconditions. In our rule language, a condition can be on symbolic variables, or on numeric variables.Conditions on symbolic variables are in the form (Variable = Value). Conditions on numericalvariables can be in the following forms : (Variable � Value), (Variable < Value), or an interval(Value1 � Variable < Value2).The head of a rule is a regression model modeling the behavior of the target variable on the datadescribed by the tail of the rule. While we could use any regression model on the head, in order tobe as interpretable as possible, we will instead restrict our approach to simple linear and quadraticregression models.

4.3 Extracting Rules Conditions

The �rst part of our rules are conditions describing a subset of the input data. The quality of therule condition, i.e. its ability to describe a coherent part of the data, will determine the quality ofthe rule. Indeed, once the algorithm �nd a condition, it �ts a regression model on the corresponding


Page 19: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

part of the data. If the condition does not describe a coherent subset of the data, the correspondingregression model will not provide a meaningful representation of the behavior of the target variable,and will provide inaccurate predictions. In this case, the condition can be further re�ned to improveits quality.

In order to �nd the rule's conditions, we chose to use pattern mining methods. Pattern miningmethods are designed to explore e�ciently large spaces of symbolic values in order to �nd regulari-ties in the data. In our case we want to �nd frequent combination of variables in the data, so we willuse methods inspired by the state of the art frequent closed itemset mining algorithm, LCM [15].Frequency is an important criterion for our rule conditions, because if a set of variables values isfrequent on the data, it may be indicative of a correlation between these values and might describean interesting subset of the data. Using closed frequent itemsets (presented in section 2.1.2) allowus the eliminate redundancy and reduce the search space.

We use a search strategy inspired by the LCM algorithm. LCM explore a lattice of frequentclosed itemset in a depth �rst manner. An arbitrary order is set on the items so that the explorationcan avoid duplicates (itemset [A,B] is equivalent to itemset [B,A]). With this order, the set of closedfrequent itemsets is organized as a tree.

Figure 4: Itemsets tree

From a node of the tree, we have a transition function that will generate all the children of thisnode. Figure 4 illustrates the concept of itemset tree.In a closed itemset mining setup, a closed frequent itemset P is extended with an item e and theclosure is computed by taking the intersection of the items present in t(P[feg). In our case, insteadof itemsets, we explore a space of rules, but the LCM principles stay relevant. Rules conditions areextended in a similar fashion, by adding a new condition and computing the closure. The maindi�erences with itemsets is that rules have a regression model. Having a regression model in a rulegive us more information : instead of just the support, we now have the regression error to accessthe quality of a rule. This additional information can be used to guide, and improve, the searchstrategy.

Just like CPXR, we can use the error of a rule to create two classes Small Error (SE) and Large

Error (LE). We tested two criterion to create these splits : split the data into two equally sizedparts by splitting w.r.t the regression median error, or to split w.r.t the mean error. Then, insteadof exploring all the possibles children of the rule, we will focus on children representative of the LEclass, meaning that rules re�nements are now focused on correcting their parent error by covering


Page 20: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

the parts where their parent is inaccurate.This search strategy has two main advantages : �rst it reduces the computation time because

we won't systematically explore all the children of a rule, secondly it can improve the accuracy inour handling of numerical values. We will explain the handling of numerical values in more detailin the next section. In our experiments on the vine diseases dataset, splitting w.r.t the mean erroryields slightly better results.

To summarize, the error of a rule can be used to guide the search strategy. From a rule, we canchose to only explore the children representative of the LE class, or we can chose to explore all thechildren.

Because we are also interested in providing a good description of the data, we decided to let thechoice of search strategy to the user. So the user can chose between completeness and computationtime.

4.4 Dealing with Numeric Values

Pattern mining methods can't naturally deal with numerical variables, indeed they are designed towork with discrete values. Numerical values being continuous, we must use a discretization processin order to integrate them in our conditions. Because there is an in�nity of ways to discretizecontinuous values, we need to chose a discretization method.

Often, continuous values are discretized as a pre-processing step before being handled by asymbolic method. For example, CPXR discretize all the numerical values before using a patternmining algorithm. In our case, we decided to discretize during the exploration. We made thischoice in order to have more precise binning and to avoid relying on an unique and inaccurate pre-processing discretization. When we generate the children of a rule, we discretize all the numericvalues in the corresponding subset of the data. If the parent rule already contains numericalconditions, we don't discretize again the corresponding variables. All the others numeric variablesare discretized. We can then use the obtained bins to generate children. The discretization processis repeated each time we generate children for a rule.

We use two di�erent kind of discretization method. The �rst one is a very simple method thatsplits a numeric variable into two part by splitting w.r.t the median value of the variable. Thismethod is often inaccurate, but it allowed us to quickly test the integration of numerical variablesin our algorithm. This method is also cheap in term of computation time.

For the second discretization method, we use a method called MDLP [16]. MDLP is a multi-interval discretization method originally developed for classi�cation problems. MDLP will attemptto discretize numerical values w.r.t the classes by �nding bins that minimize the entropy. Theobtained bins should be as representative as possible of one of the classes and not the others. Thismeans that the numerical values represented by one bin should correspond to a minimum numberof di�erent classes. In an ideal case, each bin will correspond to a unique class. We use thisdiscretization technique with the SE and LE classes of a rule so the bins are coherent with ourexploration strategy. Recall that we use the LE and SE classes to conduct our exploration strategy.This mean that the exploration can be focused on rules representative of the LE class of its parent.Even if we use the MDLP discretization with the LE and SE classes we can still chose to exploreall the children of a rule. It is not mandatory to explore in a discriminative manner.


Page 21: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

4.5 Regression models for the rules

4.5.1 Regression model choice

In the previous section, we described the process behind the generation of a rule's condition. Oncea condition is found, we need to compute a predictive model for it. In our case, the goal is toprovide numerical predictions, so a regression model is attached to the condition. The regressionmodel will be built on the subset of the data corresponding to the condition. Technically, anyregression method could be used, but because our goal is to provide interpretable models, we willfocus on interpretable regression methods. One of the most interpretable regression method is theLinear Regression. Linear regression are a well known regression method that will attempt to �ta linear function on the data. Linear regression models are often inaccurate, but since they areused as local regression models, this inaccuracy is acceptable. Overall, in most data, the predictivepower of the whole set of rules will be much better than the sole predictive power of a single linearregression model.

For the sake of accuracy, we also decided to add quadratic regression as possible regressionmodels for the rules. We can construct quadratic regression models by adding quadratic numericalvariables to our data. For example, let x1 and x2 be numerical predictive variables, and y thetarget variable. In a standard linear regression model, the obtained model will be in the formy = � + �1 � x1 + �2 � x2. To create a quadratic regression model with x1 and x2, the followingvariables will be created : x21, x

22 and x1 � x2. We will then obtain a model in the form y =

�+ �1� x1+ �2� x2+ �3� x21+ �4� x

22+ �5� x1� x2. This kind of regression models are harder

to interpret, but have the ability to model non linear functions. For understanding vine diseases,we have been told by experts that interaction variables are really di�cult to interpret. However,we have still decided to keep this kind of models for case where accuracy is more important thaninterpretability. The user can chose if the algorithm is allowed to build quadratics models.

4.5.2 The Over�tting Problem

As we increase the complexity of the regression models, we also increase the risk of over�tting. Tolimit the risk of over�tting when we use quadratic regression models, we also �t a linear regressionmodel. So for each rule, we will �t two regression models. We will chose which of the regressionmodel to use by evaluating their prediction stability. To that end, we test the two regressionmethods on the subset of data corresponding to the rule with cross validation. After the crossvalidation process, we will have the error score for each method on each of the cross validationsplits. The regression method that have the most stable error among the splits is less likely toover�t the data. To measure the stability of the error on the cross validation splits, we use theIndex of Dispersion metric :

D =�2

Where � is the standard deviation and � the mean. The lower the index of dispersion, the lessdispersed are the values. We chose for the rule the regression method with the lowest index ofdispersion.

There are other ways to evaluate the over�tting of a regression method : we can compare theerror produced by a model trained on the whole data with the mean error obtained on the crossvalidation errors. The lower the error di�erence, the less a model over�ts. So for each rule, wecompare the error di�erence of each methods and we chose the one with the lowest. Overall, there


Page 22: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

is no sure ways to systematically determine if a regression model is over�tting the data, but thesemethods can provide some information that can be used to make a choice.

4.5.3 Sparse Regression Models

Standard linear regression will use all the available predictive variables to build the linear model.Often, especially with high dimension data, some variables aren't of any use to make predictions.This mean that some of these variables will have a very low importance on the linear model. Thiswill lead to an unnecessary complexity of the linear model that will make it harder to interpret.Because our goal is to provide interpretable rules, we want to minimize as much as possible thecomplexity of the regression model.

This problem can be formulated as �nding a sparse solution to an undetermined linear system.A sparse representation of a linear system is a linear representation with a few non-zero coe�cients.This means that variables that doesn't contribute much will be assigned a coe�cient equal to zero(instead of a low-valued coe�cient in standard linear regression). Variable with a coe�cient equalto zero can then be ignored in our rule representation, so we have simpler regression models withouthaving to compromise much on the accuracy. In the worst case, the sparse representation is equalto the dense representation. In our observations on the vine diseases dataset, most of the timesparse linear regression uses less non-zero coe�cients than standard linear regression.

The problem of �nding a sparse representation of an undetermined linear system is a knownNP-hard problem [17] and have been tackled by several algorithms. Two of the most popularalgorithms to solve this problem are Orthogonal Matching Pursuit (OMP) [18] [19] and Least Angle

Regression (LARS) [20]. Both of these algorithms �nd an approximate solution in a greedy manner,but optimize a di�erent problem.

OMP is a greedy algorithm that start from an all-zero solution and that will greedily improvethe solution. At each step, OMP adds to the solution the variable whose value is the most correlatedto the current residual vector r = jy �X j with the current solution vector. OMP can be usedwith two di�erent optimization task :

argmin jjy �X jj22 subject to jj jj0 � N (1)

where N is the number of non-zero coe�cient.The second possible formulation is :

argmin jj jj0 subject to jjy �X jj22 � " (2)

where " is the error tolerance.The other algorithm that can be used for sparse linear regression is LARS. LARS can be used

to solve the following problem (known as LASSO) :


2jjy �X jj22 + �jj jj1 (3)

where � is a regularization parameter that handles the trade-o� between the mean square errorand the l1-norm of the solution vector . If � is set to 0, LARS will be equivalent to an ordinaryleast square regression.

We decided to use OMP because the formulation (2) is very convenient in our case. We candynamically set the desired error tolerance during the mining process. The idea is that the more


Page 23: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

speci�c a rule is, the more accurate it should be. A speci�c but inaccurate rule won't be veryinteresting to the user because it will describe a speci�c part of the data with a low accuracy.When we re�ne a rule for specialization, we already know its error. All specialized rules shouldhave a better accuracy than the rule they are derived from. So each rule will have as target errortolerance ", the norm of their parent's residual vector. It will force a rule to at least match the errorof its direct parent. In some cases, it is impossible for a rule to match its parent accuracy. It iscaused by the fact that the condition doesn't match a coherent part of the data. Theses rules aren'tinteresting for us, so we can directly discard them from the output. However, we still keep themfor further re�nements. It is possible that by adding conditions to the rule we end up matching amore coherent part of the data.

It may be possible to replicate the same kind of process with the LARS method. The ideawould be to tune the regularization parameter � dynamically during the mining process to controlthe trade-of between sparsity and accuracy. An idea would be to use the LASSO Path to �nd anappropriate value for �. The LASSO Path contains the number of non-zero coe�cient for eachvalue of alpha tested. We will experiment with this possibility in future works.

4.6 Rule Selection

At the end of the mining process we have a set of rules that can potentially be really large. Forexample, on the vine disease dataset, with a minimum support of 2%, the obtained number of rulesis 1154. In this large set of rules, there will be a many overlapping rules, rules with poor accuracy,and on the other hand very accurate and interesting rules. Many of these rules aren't actuallyuseful for the user. There will be way too much rules for users to be able to manually study them.Likewise, many of theses rules won't be useful to make prediction. Consequently, we designed apost processing step to select a good subset of rules among all the rules found. The goal is to beable to automatically chose a good subset of rules, and to have enough exibility to let the user tochose between the interpretability and the accuracy of the output. We modeled the rule selectionas an Mixed Integer Programming (MIP) problem. We used the ortools constraint programmingsolver, which also provide a interface to a MIP solver called Coin-or Branch and Cut (CBC).

We tested several problem formulations. In all formulations, the goal is to minimize the overallerror of the selected set of rules under several constraints. The �rst common constraint is thenumber of rules. The user provides a number K of desired rules. This number allow the user tochoose the trade-of between accuracy (large number of rules) and interpretability (small number ofrules).

Strict formulation

The objective function is :



with e(R) the rules error vector. Each element of this vector correspond to the error of one rule.XR is a vector of 0 or 1 of the same size as e(R). The MIP solver will �nd the optimal vector XR

so that it minimize the sum of the the vector e(R)�XR.This problem must be solved with the following constraints :

for each instance p on the data :X


XR(r) � 1


Page 24: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,


XR(r) � 4

where (p) is the vector of rules covering the data instance p. The �rst constraint is to forcethe selected set of rule to cover all the instance of the data. The objective is to have a completedescription of the dataset. The second constraint serve to limit the overlap of the selected set ofrule. We arbitrary decided to only allow a maximum of four rule to cover one data instance.

Relaxed formulation

The objective function of the relaxed version is :


(e(R)� w � s(R))XR

The main di�erence with the strict formulation is the adding of s(R) that is the support of the rulesR. In this case, both the error and the support are scaled between 0 and 1. w is a coe�cient usedto weight the importance of the support in the objective. In our experiments, without a weight,the solver was selecting very general rules (rules with high support) but with a poor accuracy. Weare currently using w = 0:5 in our implementation to make the accuracy more important than thesupport.In this relaxed version we also change one of the constraint on the minimum number of rule neededto cover a data instance. We are now allowing instances to not be covered by the selected set ofrules : X


XR(r) � 0

Both the strict and the relaxed formulation also have this common constraint :


This constraint will force the solver to chose at least K rules in the solution. We do not set anupper limit to the number of rule in order to cover cases where an optimal solution may have morethan K rules, as well as leaving some freedom to the solver for di�cult cases. K is a user parameter.

In most case, the strict formulation can't be used in practice as it is often impossible to respectall the constraints. The biggest problem is the coverage constraint. We can't be sure that we coverall the dataset with our set our rules. When this case occurs, it will be impossible to select a subsetof rules respecting this constraint.


Page 25: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

4.7 Algorithm Description

input : D: Dataset, K: Minimum number of rules, minsup: Minimum support thresholdoutput: S: set of rules

1 stack frequents symbolic values w.r.t minsup2 Stmp ; //list of rules3 while stack 6= ; do4 r = new empty rule5 r.condition stack.pop6 localdata r.regression �tRegression(localdata)8 Stmp.add(r) //add r to the solution9 discretize(localdata) //discretization of numeric variables10 ext frequent values of localdata w.r.t minsup11 foreach cond in ext do

12 newcond r.condition [ cond

13 if newcond is frequent then

14 newlocaldata newcond closure(newlocaldata)16 stack.push(newcond)

17 end

18 end

19 end

20 S rulesSelection(Stmp) // select a minimum of K rules

Algorithm 1: Hybrid rules algorithm overview

Algorithm 1 provide an high level view of the main processing step of our Hybrid Rules algo-rithm. The algorithm explores the space of rules in a depth �rst manner.

Lines 9 to 16 correspond to the rule re�nement step. Line 9 is the discretization step that createall possibles numerical condition to extend the current rule. The variable ext, line 10, contain allthe frequent conditions usable for the rule re�nement. By frequent conditions, we mean conditionsmatching a subset of the data larger than the minimum support threshold. Each possible conditionwill be added to the current condition in order to create a new condition. In practice, we do followan arbitrary order on the dataset's variables while constructing new conditions to avoid generatingduplicates. A condition can be extended with a new condition only if the variable in the newcondition doesn't precede the variables already in the condition.

Lines 4 to 8 correspond to the addition of a regression model for a rule. Line 7 is the step wherethe regression is �t on the data corresponding to the rule. This includes the choice between linearand quadratic regression discussed in section 4.5.2, when quadratic regression are allowed.

Line 20 correspond the rule selection process discussed in section 4.6. The selection is doneusing one of the possible formulations. We automatically relax some constraints if the solver can't�nd a solution. We can increase the number of rules allowed to cover a data instance, and reducethe minimum number of desired rules until the solver �nds a solution.


Page 26: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

5 Experimental evaluation

In this section we report the experimental evaluation of our method on several datasets. We evaluatethe performance of the tested methods using two di�erent error metrics. There are many di�erenterror metrics usable for regression problems. Most of them involve the expression of a di�erencebetween the predicted values y and the true values y. In this experimental section, we use twodi�erent error metrics : the �rst is the Root Mean Squared Error (RMSE) which is a standarderror metric for regression problems.


vuut 1



(yi � yi)2

However, because of its use of a squared error, RMSE strongly penalize large errors.We use a second error metric, the Median Absolute Error (MAE), an other standard error

metric.MAE = median (jy � yj)

This error metric, thanks to its use of the median, don't penalize outliers as much as the RMSE.We implemented our method in Python 3 with the use of the scikit-learn machine learning

library [21].

5.1 Datasets and competitors

Due to time constraints, we only tested our method on a few dataset. Our primary focus was thevine disease dataset provided by IFV, Acta and Arvalis. Indeed we developed the method with theprimary goal to exploit this kind of agricultural data. To validate the method, we tested it on twoother dataset, the CPU dataset (source UCI) and the Abalone dataset (source UCI). All of thesesdataset contain both numeric and symbolic variables that can be used as predictive variables. Table1 contains the major characteristics of theses datasets.

Dataset Number ofattributes

Number ofinstances



Vine diseases 73 32815 59 14

Vine diseases (reduced) 28 13294 14 14

CPU 10 209 8 2

Abalone 9 4177 7 1

Table 1: Datasets characteristics

Note we do not use to full Vine dataset in our experiments, we limited ourselves to the availablesymbolic attributes and the meteorological indicators. Furthermore, we only use measurementsconcerning the TNT (T�emoin Non Trait�e) observations. Agronomists are mostly interested in theTNT parcels because it is easier to understand the diseases dynamics on them. Reference parcelsare treated with phytosanitary products, and it can be di�cult to know which treatment protocolwere used by the farmers. The characteristics of this reduced vine dataset are indicated as Vinedisease (reduced) on table 1.


Page 27: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

We decided to compare our approach with several regression algorithms : �rst CPXR which isthe method we got inspired from and that introduced the use of patterns for regression problems.Due to the absence of CPXR implementation available, we made our own implementation basedon their paper. We implemented the algorithm with Python 3 and the scikit-learn library.The other main competitor is the Regression Tree algorithm. Regression Trees provides relativelyinterpretable models and are a reference regression method. We used the scikit-learn implementa-tion of the CART algorithm.The others competitors are others regression algorithms : Random Forest Regressor, a state of theart regression method, Gradient Boosting Trees a boosting algorithm that build a strong regressorwith several weak regression trees, and Support Vector Regressor, and adaptation of Support Vec-tor Machines for regressions problems. For our experiments, the reference regressor is a standardLinear Regression. For all theses methods, we used the implementation provided by scikit-learn.

5.2 Accuracy analysis

For the accuracy analysis, we will focus on the errors reductions over a reference Linear Regression.Error reduction measure the error improvement over the base Linear Regression model and isde�ned as :

Err Reduction =Err(LR)� Err(X)

Err(LR)� 100

where X is the tested regression method and Err an error metric, either RMSE or MAE. To ensurea fair comparison, we tried to set comparable parameters for all the tested methods. Hybrid Rulesare tested with a minimal support of 2%, and a K equal to 100 for the Vine Dataset, and equalto 5 for the CPU dataset and 10 for the Abalone dataset. We chose an high K for the VineDataset due to the complexity of this dataset. CPXR is also used with a minimal support of 2%.All the tree-based methods (Regression Trees, Random Forest and Gradient Boosting Trees) arealso constrained to leaves covering 2% of the tested datasets. This is done to have a similar levelof speci�city between CPXR, Hybrid Rules and tree-based regression methods. Support VectorRegressor is used with scikit-learn default parameters. We tested two version of rules, one limitedto linear models (Hybrid Rules (L)), and one allowed to use quadratic models (Hybrid Rules (Q)).

Dataset CPXR RandomForest






Vine diseases 3.9 3.83 4.03 11.34 -1.22 9.68 19.88

CPU 39.41 -20.54 -1.18 64.45 -51.87 57.05 96.05

Abalone 1.81 3.18 1.51 9.21 -3.81 4.22 6.74

Table 2: RMSE reduction (%) over Linear Regression

Table 2 shows the RMSE Reduction achieved by all the tested methods on the tested datasets.The �rst observation is that our method o�ers comparable error reduction than state of the artregression methods. Notably, our method systematically outperform the CPXR algorithm, oftenby comfortable margins. The only method that is systematically better than our method on thetested datasets if the Gradient Boosting Trees. It is interesting to note the excellent performanceof Hybrid Rules on the CPU dataset. CPU is a small dataset representing the CPU performance


Page 28: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

Dataset CPXR RandomForest






Vine diseases 12.9 75.07 78.04 68.68 -23.01 61.75 67.72

CPU 27.92 92.04 90.67 88.95 72.48 78.42 99.30

Abalone 0 8.22 8.57 14.35 12.87 8.37 10.59

Table 3: MAE reduction (%) over Linear Regression

of di�erent manufacturers. CPU performance seems to be really tied to the CPU manufacturer.Hybrid Rule are capable to automatically create rules for each manufacturers, which explains thegood accuracy. We believe that CPXR is missing some of theses particularities due to its initialsplit of the data on Low Error/Large Error classes.

However, we didn't manage to reproduce the results of the original CPXR paper with ourimplementation of the algorithm (the results in the paper are better than ours). As said earlier,there isn't any CPXR implementation available, so we can only rely on our implementation for theresults.

The MAE reduction on table 3 is less interesting, being much more unstable among the methods.Nonetheless, Hybrid Rules signi�cantly improve the MAE over the reference linear regression andachieve comparable MAE reduction than the others methods. Overall our method shows goodresults with a limited number of interpretable rules, which is encouraging.

Hybrid Rules with quadratic regression are capable of impressive results, especially on the CPUdataset. Indeed, CPU performance of the CPU dataset follows non-linear laws, so the use ofquadratic regression is particularly adapted. There is also a large improvement on the Vine diseasedataset. That can indicate that mildew intensity does not follow linear laws. However the obtainedrules are much harder to interpret so we won't use them for our exchanges with agronomists.

5.3 Impact of the number of rules on the accuracy

Figure 5 shows the impact of the number of rules on the prediction accuracy, represented by theRMSE, on the vine data. Our method was tested with �xed parameters, except for K the numberof rules. It is important to note that the vine data is the most di�cult of our tested datasets.The �rst observation if that the global tendency is that the more rules, the better the accuracy.This result isn't surprising, increasing complexity should translate in increased accuracy. We canobserve that the in uence of the number of rules on the vine disease dataset is limited, 20 rulesseems to be su�cient to already achieve a good accuracy.

But we can also observe an instability of the curve. Since our algorithm is deterministic (every-thing involving random processes use a �xed seed), and because the only varying parameter is K,we can attribute this variability to the rule selection process. We further made sure that the setof rules available at the selection process was the same for all values of K by running the selectionprocess on a saved set of rules. We do not yet fully understand the cause of this instability, weknow that some values of K may be not optimal w.r.t the set of rules to chose from. But be-cause the solver is allowed to use more than K rules on the solution it shouldn't occur. A deeperunderstanding of the working process of the MIP solver may be needed to fully comprehend thisproblem.


Page 29: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

Figure 5: Accuracy with respect to the number of rules

The number of rules may also impact the over�tting of our method. The more rules, and themore speci�c they are, the higher is the risk of over�tting. The upside of using a set of rules forpredictions is that too speci�cs rules won't be used to predict new data if they aren't relevant forit. The risk of over�tting is much higher on the regression part of the rules than on the conditions.Overall on the vine dataset, we think that even 100 is a reasonable number of rules, given thecomplexity of the data and the speci�city of certain diseases behavior. Some diseases only occur ina very small subset of the observations, so we need many speci�cs rule to capture these behaviors.

5.4 Impact of the discretization on the accuracy

Dataset Hybrid Rules(MDLP)

Hybrid Rules(Median)

Vine diseases 7.84 8.30

CPU 20.24 19.79

Abalone 2.14 2.12

Table 4: Accuracy comparison between discretization methods (RMSE)

Table 4 compare the RMSE of our rules depending on the discretization method employed. Onthe three datasets, the accuracy scores are really similar, except for the vine diseases dataset. Thevine dataset really bene�ts from a more accurate discretization process. Analysis of the obtainedrules shows that many of them contain numerical conditions, so by using MDLP to discretize thenumerical values, we are able to provide more relevant conditions.


Page 30: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

On the Abalone dataset, there is almost no di�erence in accuracy. When we analyze theoutputted set of rules, we see that only one of the 10 rules selected had a numerical condition. Thisrule being of a small support, it doesn't weight much on the accuracy, so the discretization methoddoesn't really matter in this case.

For the CPU dataset, it seems that using the median value of a variable as splitting point ismore e�cient than using MDLP. Analysis of the rules show that the most important variable inthe conditions is the manufacturer name, which is a symbolic attribute (all the rules contain acondition on it). For the others conditions on numeric values, the median value seems to be acorrect splitting point in this particular case.

5.5 Running times analysis

In this section we will study the running time of our method depending on the kind of regressionand the discretization method. The results were obtained on a Linux computer running the Fedora27 distribution and with the following hardware speci�cation : CPU Intel Xeon E5-1603-v3, 8 GBof RAM. It is worth noting that the current implementation is programmed in the Python languageand that many optimization could be done to improve running times.Like in our others experiments, we are running our method with a minimum support of 2%.

Dataset Hybrid Rules(L)

Hybrid Rules(Q)

Vine diseases 196.5s 244.5s

CPU 0.4s 0.84s

Abalone 2.5s 2.76s

Table 5: Running times comparison between linear Hybrid Rules and quadratic Hybrid Rules

Dataset Hybrid Rules(MDLP)

Hybrid Rules(Median)

Hybrid Rules(None)

Vine disease 196.59s 77.25s 8.22s

CPU 0.41s 0.89s 0.16s

Abalone 2.53s 4.36s 1.46s

Table 6: Running times comparison between discretization methods

The running time depend a lot of the kind of discretization used. In the vine disease dataset,the kind of discretization really in uence the running time. Using the median discretization insteadof MDLP improve the running of around 60%. Using no discretization at all further reduce runningtimes on all the datasets. When no discretization is used, only symbolic variables are consideredfor rules conditions, which signi�cantly reduce the space of possible rules.

Using quadratics models instead of linear models also increase the running times, but the impactisn't as strong as the choice of discretization method.


Page 31: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

Figure 6: Noise sensitivity of regression methods

5.6 Noise sensitivity

In order to provide good quality predictions, regression models need to be the least sensible aspossible to noise. For example, in the case of agricultural data, this would mean that the qualityof the learned models would easily be altered by errors in the measurements. High sensitivity tonoise can also be indicative of over�ttingness.

To evaluate the noise sensitivity, we measure the di�erence between the errors made on a cleandata with the errors made on a data with arti�cial noise. We chose to use the CPU dataset for thistest. We added di�erent level of noise to the data in the following manner : for each noise level n,we change the value of the target variable y with y + y � z, with z a random value in the range[�n; n]. The noise level used were 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35 and 0.40.

It seems that Gradient Boosting Trees are the most sensitive to noise. Then Hybrid Rules, and�nally Regression Trees. The parameters used for the regression tree were relatively conservative,so the obtained tree was relatively general, limiting the over�tting and noise sensitivity. The noisesensitivity of Hybrid Rules can be attributed to the sensitivity of the linear regression models usedin the rules. Indeed, the rules conditions obtained at di�erent noise level were the same, only theregression parts changed.

6 Result discussion

We presented some of our results obtained on the vine disease dataset to experts to have theiropinions about the rules. Our rules were interesting, but needed an overhaul on the presentation.They also needed more information about the context of the rules. They were interested in the


Page 32: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

dates concerned by a rules. Some rules already had time information like the week of observation,or the month of observation, but it was not the case for some others. In any cases, for diseasesdata, having the dates information for each rules is important for interpretability.

Another important information to help the interpretability of the rules is to add geographicalinformation. For vine, some regions are more subject to diseases than others. Knowing whichvineyards are covered be a rule would provide interesting information.

As an example, this is one of the rule obtained to predict the intensity of mildew on leaves.

IF (vine kind = "Merlot Noir") ^ (101 � number of rainy days < 138)

then Mildew intensity on leaves =

�71+0:01�(Sum relative hummidity the previous month)+0:01�(Sum of temperature amplitudes)

On this rule, we can see that the regression use some variables like relative humidity. It seemsthat relative humidity is known by experts to have a signi�cant role in the development of mildew,so it isn't completely surprising to use it to predict mildew intensity. Often relative humidityisn't used on current statistical model because it is di�cult to measure it accurately. Nonetheless,high relative humidity is still a known favorable factor to mildew. The other variable used on theregression part is the temperature amplitude. High temperature amplitude would be responsiblefor night dew. Dew is also a known aggravating factor to the development of mildew.

With this feedback, we can see that our method is capable of �nding rules with known mildewfactor. To further improve the results, we would need to run the method with a greater accuracy(low minimum support and high number of rules) to manually handpick speci�c rules describinghigh mildew cases.

7 Conclusion and future works

In this internship we developed a method capable of extracting hybrid rules containing numericand symbolic variables. Hybrid rules is able to provide interpretable models usable for regressionproblems. Our method have been applied to the BSV's vine data, showing interesting results, onthe accuracy when compared to other state of the art algorithm, but also by providing interpretablerules that have been examined by experts.

On others datasets, the results were encouraging, showing that interpretable models can alsoprovide comparable accuracy to some "black boxes" regression methods. The results show thatcombining pattern mining and regression is indeed a viable method for machine learning tasks.

One of �rst thing we will do in the near future is to use a larger number of test dataset forour experiments. Ideally we would like to compare ourselves to all of the �fty dataset used in theCPXR paper. Unfortunately, most of them aren't available online.

Our exchanges with agronomists provided us with many possible improvements to help themunderstanding crops diseases. On the vine disease dataset, most of the observations are concerningvineyards with a low amount of diseases. Indeed, most of the time, vine crops are hopefullyuna�ected by diseases. So our method tend to �nd rules for cases with low amount of diseases,as well as case with no diseases at all. Only very few rules were concerning high level of diseases.What interests agronomists is high levels of diseases. So in our work forward, we will modify ourmethod so that it �nds in priority rules with high level of mildew. There is many di�erent ways totackle this problem. We could change the rule section process to select high diseases rules instead


Page 33: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

of optimizing the overall error. The most e�cient way would probably to tackle this directly onthe exploration process. We could direct rules re�nements to target parts of the data with highlevel of diseases. Another solution may be to use some kind of Exceptional Model Mining method.Indeed, high level of disease are relatively exceptional in the data.

Another demand, also relative to the diseases level, would be to class the obtains rules into sev-eral categories based on their predicted level of diseases. For example we would present ten di�erentset of rules, one for rules predicting less than ten percent of disease, then rules predicting morethan ten percent, rules predicting more than twenty percent, and so on. With this presentation,experts would be able to have rules covering all the possibles cases.

There is also speci�cs time periods we could target with our rules because they are particularlyimportant for the development of diseases. In the case of mildew, if we are interested by the diseasefrequency on vine leaves, we should target the second half of May as well as the full month of June.If we are interested in the mildew intensity on vine grapes, we should instead look at observationsmade during the end of July and the beginning of August.

Lastly, our validations on the vine disease data were done by K-fold cross validation. Our crossvalidation process randomly shu�e the data into K parts. Then it use K-1 parts for training and 1part for testing, then repeat until all the splits were used for training and testing. It seems that thismethod isn't optimal in the case of diseases monitoring data. Instead of using randomly createdsplits, we should instead remove an entire year from the data and use the others years as trainingdata. Then, we test the trained models on the data from the removed year. This way we can see ifour models are capable of correctly predict disease from a year with diseases level of other years.

On the method side, we envision to extend the rule language. More speci�cally, we would liketo include negative conditions. A negative condition would be in the form (Variable 6= Value).This kind of conditions would be used for symbolic variables only. The main issue moving towardsthis type of condition is that it will signi�cantly increase the space of possible rules. For eachexisting symbolic condition, we would add its negation to the possibilities. There may be otherway to further improve the rule language. In any case, we think it may be worth to improve theexpressiveness of our rule conditions.


[1] L. Michel, Mieux valoriser les r�eseaux d'�epid�emiosurveillance lors de l'�elaboration du Bul-

letin de Sant�e du V�eg�etal. PhD thesis, Institut des Sciences et Industries du Vivant et del'Environnement (AgroParisTech), 2016.

[2] R. Agrawal, R. Srikant, et al., \Fast algorithms for mining association rules," in Proc. 20th

int. conf. very large data bases, VLDB, vol. 1215, pp. 487{499, 1994.

[3] M. J. Zaki, \Generating non-redundant association rules," in Proceedings of the sixth ACM

SIGKDD international conference on Knowledge discovery and data mining, pp. 34{43, ACM,2000.

[4] L. A. Gal�arraga, C. Te ioudi, K. Hose, and F. Suchanek, \AMIE: association rule mining underincomplete evidence in ontological knowledge bases," in Proceedings of the 22nd international

conference on World Wide Web, ACM, 2013.


Page 34: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

[5] J. Wang, J. Han, Y. Lu, and P. Tzvetkov, \TFP: An e�cient algorithm for mining top-kfrequent closed itemsets," IEEE Transactions on Knowledge and Data Engineering, vol. 17,no. 5, pp. 652{663, 2005.

[6] J. Han, J. Pei, and Y. Yin, \Mining frequent patterns without candidate generation," inProceedings of the 2000 ACM SIGMOD International Conference on Management of Data,SIGMOD '00, (New York, NY, USA), pp. 1{12, ACM, 2000.

[7] H. Li, J. Li, L. Wong, M. Feng, and Y.-P. Tan, \Relative risk and odds ratio: A data miningperspective," in Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART sympo-

sium on Principles of database systems, pp. 368{377, ACM, 2005.

[8] D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and G. Zoubin, \Structure discovery innonparametric regression through compositional kernel search," in Proceedings of the 30th

International Conference on Machine Learning, pp. 1166{1174, PMLR, 17{19 Jun 2013.

[9] A. Karali�c and I. Bratko, \First order regression," Machine Learning, vol. 26, no. 2-3, pp. 147{176, 1997.

[10] N. Fanizzi, C. d'Amato, F. Esposito, and P. Minervini, \Numeric prediction on OWL knowl-edge bases through terminological regression trees," International Journal of Semantic Com-

puting, vol. 6, no. 04, pp. 429{446, 2012.

[11] G. Dong and V. Taslimitehrani, \Pattern-Aided Regression Modeling and Prediction ModelAnalysis," IEEE Transactions on Knowledge and Data Engineering, vol. 27, pp. 2452{2465,Sept. 2015.

[12] W. Duivesteijn, A. J. Feelders, and A. Knobbe, \Exceptional Model Mining: Supervised de-scriptive local pattern mining with complex target concepts," Data Mining and Knowledge

Discovery, vol. 30, pp. 47{98, Jan. 2016.

[13] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, \Classi�cation and regressiontrees," 1984.

[14] D. Malerba, F. Esposito, M. Ceci, and A. Appice, \Top-down induction of model trees withregression and splitting nodes," IEEE Transactions on Pattern Analysis and Machine Intelli-

gence, vol. 26, no. 5, pp. 612{625, 2004.

[15] T. Uno, T. Asai, Y. Uchida, and H. Arimura, \Lcm: An e�cient algorithm for enumeratingfrequent closed item sets.," 2003.

[16] U. Fayyad and K. Irani, \Multi-interval discretization of continuous-valued attributes for clas-si�cation learning," 1993.

[17] B. K. Natarajan, \Sparse approximate solutions to linear systems," SIAM journal on comput-

ing, vol. 24, no. 2, pp. 227{234, 1995.

[18] S. G. Mallat and Z. Zhang, \Matching pursuits with time-frequency dictionaries," IEEE Trans-

actions on signal processing, vol. 41, no. 12, pp. 3397{3415, 1993.


Page 35: Master research Internship Internship report Numerical ... … · In this internship report, we will present a data mining method capable of extracting symbolic and numeric rules,

[19] R. Rubinstein, M. Zibulevsky, and M. Elad, \E�cient implementation of the k-svd algorithmusing batch orthogonal matching pursuit," 2008.

[20] B. Efron, T. Hastie, I. Johnstone, R. Tibshirani, et al., \Least angle regression," The Annals

of statistics, vol. 32, no. 2, pp. 407{499, 2004.

[21] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,M. Perrot, and E. Duchesnay, \Scikit-learn: Machine learning in Python," Journal of Machine

Learning Research, vol. 12, pp. 2825{2830, 2011.

This work was supported by the French National Research Agency under the Investments forthe Future Program, referred as ANR-16-CONV-0004.