ClLASSIFICATION OF VEGETABLES BASED ON DECISION TREE FOR MULTICLASS PROBLEM

International Journal of Image Processing and Visual Communication

ISSN (Online) 2319-1724 : Volume 1 , Issue 2 , October 2012

42

Classification of Vegetables based on Decision Tree

for Multiclass Problem

Suresha M1, Ravikumar M

2

Department of Computer Science, Kuvempu University

Karnataka, India [email protected] [email protected]

Abstract In this paper, we have proposed a method for classification of vegetables based on the extraction of texture

properties. The work has been carried out using watershed for

segmentation. The vegetable texture features like red component,

green component, skewness, kurtosis, variance, and energy are

extracted. The method has been employed to normalize vegetable

images and hence eliminating the effects of orientation using

image resize technique with proper scaling. Classification is done

using Mean around features, Gray level Co-occurrence matrix

(GLCM) features and combined (Mean around-GLCM) features.

Decision trees classifier is used for classification of vegetables in

to eight classes. Splitting rules for growing a decision tree

included in this work are Gini diversity index(gdi), Twoing rule,

and Entropy. Results obtained from the proposed method are

well accepted and solutions are good agreement with the experts.

Proposed approach is experimented on vegetable data set using

cross validation and found good success rate.

Keywords Decision Tree Classifier, GLCM, Mean-around Features, Texture Features, Vegetables Classification.

I. INTRODUCTION

Recognizing different kinds of vegetables and fruits is a

recurrent task in supermarkets and food processing industries.

Often, one needs to deal with complex classification problems.

In such scenarios, using just one feature set to capture the

classes seperability might not be enough and more features may become necessary to improve the accuracy of

classification. Besides it has the drawback of increasing the

dimensionality of the data which might require more training

samples but increases the accuracy. This paper presents

method for classification of vegetables with Mean around

features, GLCM features and combined features.

There are eight types of vegetables considered for this work,

namely Cabbage, Beetroot, Capsicum, Carrot, Chillies,

Cucumber, Bittermelon and Onion. In the proposed method

vegetables are classified based on the Mean Around features

like, red color component, green color component, Kurtosis, Variance and Gray level co-occurrence matrix (GLCM)

features like Contrast, Correlation, Energy and Homogeneity.

In the rest of this paper, we describe some related works

briefly in Section II. We present segmentation methodology in

Section III, which includes segmentation using watershed

segmentation. Feature extraction and classification of

arecanuts using decision trees, presented in section IV and V

respectively and we put experimental results and analysis in

Section VI. Finally, conclusions are drawn in section VII. The block diagram of overall process is given in Fig. 1.

Fig. 1 Block diagram of overall process

II. LITERATURE SURVEY

In [18], a new model of automated grading system for oil

palm fruit is developed using the RGB color model and

artificial fuzzy logic. The mean color intensity based on RGB

color model is determined and achieved 86.67% accuracy in

overall categories. In [14], a methodology for recognition and

classification of fruits in fruits salad image samples. The

samples of different fruits like Apple, Chikku, Banana,

Orange and Pineapple are considered. Each sample of fruits are sliced into pieces and placed on the tray. The RGB color

features extracted from the images from the knowledge base.

A K mean classifier is proposed and has the classification

efficiency of around 98%. In [12], an efficient fusion of color

and texture features for fruit recognition. The recognition is

done by the minimum distance classifier based upon the

statistical and co-occurrence features derived from the

Wavelet transformed sub-bands. Experimental results on a

Input Image

Segmentation [Watershed Segmentation]

Feature Extraction

Decision Tree Classifier for Classification

Labelled Image

Mean around

Features GLCM

Features

Combined (Mean

around-GLCM)

Features



43

database of about 2635 fruits from 15 different classes

confirm the effectiveness of the proposed approach. In Woo [16], a new Fruit recognition system has been proposed,

which combines three features analysis methods: color-based,

shape based and size-based in order to increase accuracy of

recognition. Proposed method classifies and recognizes fruit

images based on obtained features values by using nearest

neighbors classification. Consequently, system shows the fruit

name and a short description to user. Proposed fruit

recognition system analyzes, classifies and identifies fruits

successfully up to 90% accuracy. In [19], a novel method for

realizing the color classifying in yarn-dyed fabric is proposed.

The color image of yarn-dyed fabric was obtained by a flat scanner, and then it is converted from RGB color space to

Lab color space. FCM was selected as the Color Cluster

method. The color yarn number is detected based on the

validity for FCM clusters. In [17], a method to identify human

skin region in an image on color classification. The method

classifies the colors of all pixels in the image into several

classes through K-means algorithm and segments the image

into several parts according to the color class that each pixel

belongs. Find the class whose feature vector has the minimum

distance to the skin color feature vector previously defined in

the color space. In[3], an image classification technique that

uses the Bayes decision rule for minimum cost to classify pixels into skin color and non-skin color. Color statistics are

collected from YCbCr color space. In [8], a method to classify

more than ten categories of seed defects by using color,

texture features and support vector machine (SVM) type

classifier. In the image classification part, color histograms in

RGB and HSV color space together with texture based on

Grey level co-occurrence matrix (GLCM) and Local binary

pattern (LBP) is adopted as features. The proposed systems

were evaluated from more than 10,000 sample images. The

obtained accuracies are 95.6% for normal seed type and 80.6%

for group of defect seed types.

III. SEGMENTATION

Image segmentation is a process that partitions an image

into its constituent regions or objects. Effective segmentation

of complex images is one of the most difficult tasks in image

processing. Various image segmentation algorithms have been

proposed to achieve efficient and accurate results. Among

these algorithms, watershed segmentation is a particularly

attractive method. The major idea of watershed segmentation

is based on the concept of topographic representation of image

intensity. Meanwhile, Watershed segmentation also embodies

other principal image segmentation methods including

discontinuity detection, thresholding and region processing. Because of these factors, watershed segmentation displays

more effectiveness and stableness than other segmentation

algorithms [11]. Watershed segmentation is an effective

method for gray level vegetable image segmentation. To apply

watershed segmentation to binary images, we need to pre-

process the vegetable binary images with distance transform

to convert it to gray level images which are suitable for

watershed segmentation. The common Distance Transforms

(DTs) include Euclidean, City block and Chessboard. Different DTs produce very different watershed segmentation

results for the vegetable binary images. For vegetable images

containing components of different shapes, we find that the

Chessboard DT can achieve better watershed segmentation

results than Euclidean DT and City block DT.

IV. FEATURE EXTRACTION

In the feature extraction process, we determine distribution

patterns of data based on the red color component, green color

component, Skewness, Kurtosis and Variance around the

mean of sample data. Further, we determine GLCM features

such as Contrast, Correlation, Energy and Homogeneity based on intensity of pixels. Further confusion matrix is determined

for the above Mean around features, GLCM features and

combined (Mean around- GLCM) features.

A. Mean-around Features

Color components and texture features are the prominent

features for classification of vegetables. In the distribution

features there are five features of vegetables are considered,

these features are Red component, Green component, Kurtosis,

Variance and Skewness.

The average value of the red component ( R ) of an RGB

image is obtained by using equation (1).

),(1

1 1

jifNXM

M

i

N

jRR

(1)

Where fR is the red component of an RGB image and M & N

are the rows and columns of an image.

The average value of the green component ( G ) of an RGB

image is obtained by using equation (2).

),(1

1 1

jifNXM

M

i

N

jGG

(2)

Where fG is the green component of an RGB image and M &

N are the rows and columns of an image.

Skewness is a measure of the asymmetry of the data around

the sample mean. If skewness is negative, the data are spread

out more to the left of the mean than to the right. If skewness

is positive, the data are spread out more to the right. The

skewness of the normal distribution (or any perfectly

symmetric distribution) is zero. The skewness of a distribution

is defined as in equation (3)

3

3)(

xESSkewness (3)

Where x is the pixel value, is the mean of x, is the standard deviation of x, and E(x - ) represents the expected value of the quantity (x ).



44

Kurtosis is a measure of distribution outlier-prone. The

kurtosis of the normal distribution is 3. Distributions that are more outlier-prone than the normal distribution have kurtosis

greater than 3; distributions that are less outlier-prone have

kurtosis less than 3. The kurtosis of a distribution is defined as

in equation (4)

4

4)(

xEKKurtosis (4)

Where x is the pixel value, is the mean of x, is the standard deviation of x, and E(x - ). represents the expected value of the quantity )( x .

Variance returns the variations in the pixels of an image. The

variance of a distribution is defined in equation (5)

2),(

1 1),( )( ji

M

i

N

jji ffVVariance

(5)

Where f is the gray level image and f is the average value

of the pixels in a gray level image of vegetable.

B. GLCM Features

Texture feature uses the contents of GLCM to measure the

variation in intensity at a pixel of interest. [6] first proposed in

1973, they characterize texture using a variety of quantities

derived from second order image statistics. Co-occurrence

texture features are extracted from an image in two steps. First,

the pairwise spatial co-occurrences of pixels separated by a particular angle and distance are tabulated using GLCM.

Second, the GLCM is used to compute a set of scalar

quantities that characterize different aspects of the underlying

texture. The GLCM is a tabulation of how often different

combinations of gray levels co-occur in an image or image

section [6].

TABLE I

GLCM FEATURES

Contrast ji

jipji,

2 ),(||

Correlation

ji ji

ji jipji

,

),()()(

Energy ji

jip,

2),(

Homogeneity ji ji

jip

, ||1

),(

The GLCM is N x N square matrix, where N is the number

of different gray levels in an image. An element p(i, j, d, ) of a GLCM of an image represents the relative frequency, where

i is the gray level of the pixel p at allocation (x,y) , and j is the

gray level of a pixel located at a distance d from p in the

orientation . While GLCMs provide a quantitative description of a spatial pattern, they are too unwieldy for

practical image analysis. [6] proposed a set of scalar quantities

for summarizing the information contained in a GLCM. He

originally proposed a total of fourteen features. However, only

subsets of these are used [9]. The four derived features used in

our work are given in TABLE II.

V. DECISION TREE CLASSIFIER

Decision trees are easy to interpret, computationally

inexpensive, and capable of coping with noisy data. Therefore,

the techniques have been widely used in various applications,

such as pattern recognition [13], credit and loan evaluation [13], [7], fraud and network intrusion detection [15], [4], and

medical diagnosis and healthcare management [10]. Decision

tree learning used in statistics, data mining and machine

learning, uses a decision tree as a predictive model which

maps observations about an item to conclusions about the

item's target value. More descriptive names for such tree

models are classification trees or regression trees. The

majority of decision trees deal with the classification problem,

which is also the main goal of this paper. In this context, the

technique is also referred to as classification trees. In this

paper, we deal with binary trees, where each split produces

exactly two child nodes. Four splitting rules that are widely available for growing a

decision tree include: gini, twoing, and entropy. Each of the

splitting rules attempts to segregate data using different

approaches. The gini index is defined as:

i

ii pptGini )1()( (6)

Where pi is the relative frequency (determined by dividing the total number of observations of the class by the total

number of observations) of class i at node t, and node t

represents any node (parent or child) at which a given split of

the data is performed [1]. The gini index is a measure of

impurity for a given node that is at a maximum when all

observations are equally distributed among all classes. In

general terms, the gini splitting rule attempts to find the

largest homogeneous category within the dataset and isolate it

from the remaining data. Subsequent nodes are then

segregated in the same manner until further divisions are not

possible. An alternative measure of node impurity is the

towing index:

i

RLRL tiptiP

PPtTwoing 2| )))|()|((|(

4)( (7)

Where L and R refer to the left and right sides of a given

split respectively, and p(i|t) is the relative frequency of class i

at node t [2]. Twoing attempts to segregate data more evenly

than the gini rule, separating whole groups of data and

identifying groups that make up 50 percent of the remaining

data at each successive node. Entropy, often referred to as the

information rule, is a measure of homogeneity of a node and is defined as:



45

i

ii pptEntropy log)( (8)

Where pi is the relative frequency of class i at node t [1].

The entropy rule attempts to identify splits where as many

groups as possible are divided as precisely as possible and

forms groups by minimizing the within group diversity [5].

This rule can be interpreted as the expected value of the

minimized negative log-likelihood of a given split result and tends to identify rare classes more accurately than the

previous rules.

a) Color Image b) Labeled Image c) Grayscale Image

Fig. 2 Sample Experimental Results

VI. RESULTS AND DISCUSSION

In this work we have created our own vegetable database.

We collected vegetable images from World Wide Web in

addition to this some images were taken in and around our

place using Canon Digital camera with natural day light. All

the Images were taken to approximately fill the camera field

of view in natural day light with white background. Images

were resized into 300 X 300 pixel resolution to speed up

computation. We considered most commonly available



vegetables. The vegetable database contains 8 classes of total

582 vegetable images. Fig. 2 shows sample experimental results.

We have used Decision Trees for classification of

Vegetables. The feautre set contains average red component,

average green component, skewness, kurtosis, variance and

energy. The confusion matrix shows the accuracy of the Decision Tree. When we evaluate the training samples, we got

good classification accuracy for combined features.

TABLE III

CONFUSION MATRIX FOR GLCM FEATURES USING ENTROPY

Cabbage Beetroot Capsicum Carrot Chillies Cucumber Bittermelon Onion Total Succss Rate in %

Cabbage 93 1 1 1 0 2 1 5 104 89.42

Beetroot 5 48 0 1 0 2 1 5 62 77.41

Capsicum 5 0 37 0 0 0 0 2 44 84.09

Carrot 2 2 1 60 0 2 0 0 67 89.55

Chillies 1 2 0 3 17 1 0 1 25 68.00

Cucumber 2 0 0 3 0 36 1 3 45 80.00

Bittermelon 1 2 1 3 0 0 46 1 54 85.18

Onion 2 1 0 1 0 1 0 176 181 97.23

Total 582 88.14

TABLE III

CONFUSION MATRIX FOR MEAN AROUND FEATURES USING ENTROPY


Cabbage 99 0 2 1 0 0 2 0 104 95.19

Beetroot 1 57 0 1 0 0 0 3 62 91.93

Capsicum 0 0 42 0 1 0 0 1 44 95.45

Carrot 0 2 3 59 0 0 0 3 67 88.05

Chillies 0 0 0 1 23 0 0 1 25 92.00

Cucumber 1 0 0 0 0 44 0 0 45 97.77

Bittermelon 4 2 1 0 0 2 45 0 54 83.33

Onion 0 0 1 0 1 1 0 178 181 98.34

Total 582 93.98

TABLE IV

CONFUSION MATRIX FOR COMBINED FEATURES USING ENTROPY


Cabbage 103 0 0 0 0 0 1 0 104 99.03

Beetroot 0 57 1 1 0 0 1 2 62 91.93

Capsicum 0 0 38 1 0 0 2 3 44 86.36

Carrot 1 0 1 63 1 0 1 0 67 94.02

Chillies 0 0 0 1 24 0 0 0 25 96.00

46



47

Mis

clas

sifi

cati

on r

ate

Cucumber 1 0 0 0 0 43 1 0 45 95.55

Bittermelon 1 0 0 1 0 0 49 3 54 90.74

Onion 0 0 1 0 0 0 0 180 181 99.44

Total 582 95.70

Tree size (Number of Terminal Nodes)

Fig. 3: Estimated cost for each tree using cross validation for splitting rule entropy

TABLE V

CONFUSION MATRIX FOR GLCM FEATURES USING GDI


Cabbage 96 1 1 1 1 1 1 2 104 92.30

Beetroot 5 50 1 2 0 0 2 2 62 80.64

Capsicum 4 0 35 0 0 2 0 3 44 79.54

Carrot 2 1 1 54 1 4 2 2 67 80.59

Chillies 0 0 2 2 19 1 0 1 25 76.00

Cucumber 3 1 0 2 0 35 0 4 45 77.77

Bittermelon 2 3 0 2 0 2 44 1 54 81.48

Onion 6 0 1 0 0 0 0 174 181 96.13

Total 582 87.11

TABLE VI

CONFUSION MATRIX FOR MEAN AROUND FEATURES USING GDI


Cabbage 101 1 0 0 0 0 1 1 104 97.11

Beetroot 2 49 1 0 1 1 2 6 62 79.03

Capsicum 2 0 38 0 0 1 2 1 44 86.36

Carrot 0 0 0 64 0 0 3 0 67 95.52

Chillies 1 0 2 1 18 0 0 3 25 72.00

Cucumber 1 0 0 0 0 43 1 0 45 95.55

Bittermelon 1 0 0 0 0 0 52 1 54 96.29



48

Mis

clas

sifi

cati

on r

ate

Onion 0 1 3 0 0 0 0 177 181 97.79

Total 582 93.12

TABLE VII

CONFUSION MATRIX FOR COMBINED FEATURES USING GDI


Cabbage 98 1 1 1 0 1 0 2 104 94.23

Beetroot 0 59 1 0 0 0 1 1 62 95.16

Capsicum 1 0 38 0 0 0 3 2 44 86.36

Carrot 0 1 2 62 0 0 2 0 67 92.53

Chillies 0 1 1 2 18 0 0 3 25 72.00

Cucumber 1 0 0 0 0 43 1 0 45 95.55

Bittermelon 0 0 0 0 0 1 53 0 54 98.14

Onion 1 2 1 0 0 0 0 177 181 97.79

Total 582 94.15


Fig. 4: Estimated cost for each tree using cross validation for splitting rule gdi

TABLE VIII

CONFUSION MATRIX FOR GLCM FEATURES USING TWOING RULE


Cabbage 96 2 0 0 0 0 2 4 104 92.30

Beetroot 2 43 0 3 1 4 5 4 62 69.35

Capsicum 1 0 30 1 1 4 1 6 44 68.18

Carrot 1 1 1 55 0 5 1 3 67 82.08

Chillies 1 0 1 1 20 1 0 1 25 80.00



49

Cucumber 4 1 0 3 0 37 0 0 45 82.22

Bittermelon 0 2 1 1 0 0 45 5 54 83.33

Onion 9 1 2 1 1 1 1 165 181 91.16

Total 582 84.36

TABLE IX

CONFUSION MATRIX FOR MEAN AROUND USING TWOING RULE


Cabbage 103 0 0 0 0 0 1 0 104 99.03

Beetroot 4 54 0 0 1 0 0 3 62 87.09

Capsicum 2 0 39 0 0 0 0 3 44 72.22

Carrot 2 2 0 61 0 1 1 0 67 91.04

Chillies 1 0 0 3 20 0 0 1 25 80.00

Cucumber 0 0 0 1 0 44 0 0 45 97.77

Bittermelon 2 0 0 0 0 0 52 0 54 96.29

Onion 4 0 1 1 0 1 0 174 181 96.13

Total 582 93.98

TABLE X

CONFUSION MATRIX FOR COMBINED FEATURES USING TWOING RULE


Cabbage 100 0 3 0 0 0 1 0 104 96.15

Beetroot 0 59 1 0 1 1 0 0 62 95.16

Capsicum 0 0 40 1 0 2 0 1 44 90.90

Carrot 1 3 1 59 0 2 1 0 67 88.05

Chillies 1 0 1 2 20 0 0 1 25 80.00

Cucumber 2 0 0 0 0 43 0 0 45 95.55

Bittermelon 1 0 1 0 0 0 52 0 54 96.29

Onion 4 0 1 0 0 0 1 175 181 96.68

Total 582 94.15



50

Mis

clas

sifi

cati

on r

ate


Fig. 5: Estimated cost for each tree using cross validation for splitting rule Twoing

VII. CONCLUSIONS

In this paper, we have used watershed segmentation to

segment the vegetable images on the dataset. In the segmented region Mean Around features and GLCM features are

extracted. Testing is conducted on Mean Around features,

GLCM features and combined features using Decision trees

classifier with tree splitting rules gdi, twoing rule, and

entropy. Testing is conducted by using cross validation

method and found the following observations:

Splitting rule entropy for growing decision tree:

The GLCM features have given success rate of 88.14%.

Mean Around features have given success rate of 93.98%.

Mean Around-GLCM features have given success rate 95.70%.

Splitting rule gdi for growing decision tree:



Mean Around-GLCM features have given success rate 94.15%.

Splitting rule twoing for growing decision tree:



Mean Around-GLCM features have given success rate of 94.15%.

Experimental results revels that combination of Mean

Around features and GLCM features will increase the

accuracy of classification. This method can be extended to

other objects such as classification of flowers, fruits, seeds,

and vegetables etc. where human intervention is in need for

classification.

ACKNOWLEDGMENT

Authors would like to thank Sandeep Kumar K.S and Shiva

Kumar G for their help.

REFERENCES

[1] Apte C. S. and Weiss, Data mining with decision tress and decision rules, Future Generation Computer Systems, 13:197210, 1997.

[2] Breiman L., Some properties of splitting criteria, Machine Learning, 24:4147, 1996.

[3] Chai D., Bouzerdoum A., A Bayesian approach to skin color classification in YCbCr color space, Proceedings TENCON, Vol.2, 421 - 424, 2000.

[4] D. Zhu, G. Premkumar, X. Zhang, C. H. Chu. Data mining for network intrusion detection: a comparison of alternative methods, Decision Sciences, 32 (4), 635 660, 2001.

[5] Death G. K., Fabricius, Classification and regression trees: a powerful yet simple technique for ecological data analysis, Ecology, 81(11):31783192, 2000.

[6] Harlick R. M., Shanmugam K. and Dinstein I., Textural Features for image classification, IEEE Trans. on System, man and Cybernetics,

610 621, 1973.

[7] J. R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.

[8] Kiratiratanapruk K., Sinthupinyo W, Color and texture for corn seed classification by machine vision, International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS),

15, 2011.

[9] Newsam S. D. and Kamath C., Retrieval using texture features in high resolution multi-spectral satellite imagery, In SPIE Conference. on Data Mining and Knowledge Discovery, 2004.

[10] Olivia R. L. Sheng, Chih P. Wei, Paul J. H. Hu and Namsik C., Automated learning of patient image retrieval knowledge: neural Networks versus inductive decision trees, Decision Support Systems,

30 (2), 105124, 2000.

[11] Rafael C. G. and Richard E. Woods and Steven L. Eddins, Digital Image Processing using MATLAB, PPH, 2009.



51

[12] S. Arivazhagan, R. Newlin Shebiah, S. Selva Nidhyanandhan, L. Ganesan, Fruit Recognition using Color and Texture Features, Journal of Emerging Trends in Computing and Information Sciences, Vol. 1, No. 2, 90 94, Oct 2010.

[13] S. Piramuthu, On learning to predict web traffic, Decision Support Systems, 35 (2), 213 229, 2003.

[14] Vishwanath B. C., S. A. Madival, Sharanbasava Madole, Recognition of Fruits in Fruits Salad Based on Color and Texture Features,

International Journal of Engineering Research & Technology, Vol. 1 Issue 7, 1- 6, September 2012.

[15] W. Lee, S. J. Stolfo, A framework for constructing features and models for intrusion detection systems, ACM Tran. on Information and System Security. 3 (4), 227 261, 2000.

[16] Woo Chaw Seng, Seyed Hadi Mirisaee, A new method for fruits recognition system, International Conference on Electrical Engineering and Informatics, Vol 1, 130 - 134, 2009.

[17] Xiaoying Fang, Wenquan Gu, Chang Huang, A method of skin color identification based on color classification, International Conference

on Computer Science and Network Technology (ICCSNT), Vol. 4, 2355

2358, 2011.

[18] Z. May, M. H. Amaran, Automated Oil Palm Fruit Grading System using Artificial Intelligence, International Journal of Video & Image Processing and Network Security, 30-35, 2011.

[19] Zhang Ronghua, Chen Hongwu, Zhang Xiaoting, Pan Ruru, Liu Jihong, Unsupervised Color Classification for Yarn-dyed Fabric Based on FCM Algorithm, International Conference on Artificial Intelligence and Computational Intelligence (AICI),Vol. 1, 497501, 2010.

ClLASSIFICATION OF VEGETABLES BASED ON DECISION TREE FOR MULTICLASS PROBLEM

Documents

Transcript of ClLASSIFICATION OF VEGETABLES BASED ON DECISION TREE FOR MULTICLASS PROBLEM