Volume 5, Issue 6, June 2015 ISSN: 2277 128X International...
Transcript of Volume 5, Issue 6, June 2015 ISSN: 2277 128X International...
© 2015, IJARCSSE All Rights Reserved Page | 751
Volume 5, Issue 6, June 2015 ISSN: 2277 128X
International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com
Association Rule Mining for Ground water and Wastelands Using Apriori
Algorithm: Case Study of Jodhpur District Mainaz Faridi
* Seema Verma
Saurabh Mukherjee
Department of Computer Science Department of Electronics Department of Computer Science
Banasthali University, India Banasthali University, India Banasthali University, India
Abstract— The advancement and improvement in data collection and storage techniques have led to collect and store
terabytes of data on daily basis. This large volume of data hides meaningful and interesting information that need to
be brought in light. This has made data mining as one of the profoundly researched domain of the recent years.
Uncovering and finding out the non- trivial, previously unknown and hidden information from large data repositories
and data warehouses is the primary goal of data mining. Data mining when applied to spatial data sets is called
Spatial Data Mining or Geographic Data Mining, where it can be used to characterize spatial data, interrelate spatial
and non spatial data and depict hidden and veiled spatial patterns. Data mining has many methods for discovering the
previously unseen patterns and trends such as clustering, classification, prediction, regression, outlier detection,
association rule mining etc. In this research paper, authors propose to mine association rules between ground water
and wastelands using spatial data mining techniques. The salt-affected waste lands and waste lands without scrubs
showing higher ground water level underneath can be irrigated using this water thereby increasing the area under
cultivation.
Keywords— Spatial Data Mining, Association Rule Mining, Apriori Algorithm, Wastelands, Ground Water.
I. INTRODUCTION
WRIS and BOOSAMPDA are two major projects run by ISRO (Indian Space Research Organization) and NRSC
(National Remote Sensing Centre) providing country wide information on ground water and data relevant to land cover
across India in form of maps respectively, producing huge amount of data related to ground water and land-cover[1]. The
tremendous volume of numeric and geospatial data stored in different formats, databases and data repositories imposes a
need for a wide range of tools and techniques to analyze, query, uncover data patterns or even predict phenomenon where
human intelligence alone is not sufficient to solve complex cases [2] New technologies and methods are needed to
explore these large databases for hidden and implicit knowledge, special patterns, or correlation between spatial and non
spatial attributes[3]. Recent research activities on knowledge discovery on large spatial databases have paved a
foundation for spatial data mining techniques.
A. Spatial data mining
Spatial data mining i.e. discovery of interesting, implicit knowledge in spatial databases, provides means for
understanding and use of spatial data- and knowledge- bases. Spatial data mining is also referred to as Geographical Data
Mining [4] and Knowledge Discovery in Spatial Database [5]. The main difference between data mining and spatial data
mining is that in spatial data mining tasks we use not only non-spatial attributes (as it is usual in data mining in non-
spatial data), but also spatial attributes. Traditional data mining has no or very little dependence between the studied
variables and lacks the ability to correlate non-spatial attributes with spatial information [6]. Spatial data mining is the
process to find and uncover useful and interesting patterns which are hidden in large spatial datasets.
Revealing interesting and potentially useful patterns from large spatial datasets is much more complex than
extracting the corresponding patterns from conventional numeric and categorical data sets. The complexity of spatial data
types, relationships and autocorrelation of spatial attributes account to this difficulty [7].
B. Association Rule Mining using Apriori Algorithm
Association Rule Mining (ARM) is an important and widely used technique of data mining. This is one of the
extensively used and studied methods of data mining, having a wide range of application areas. The most common
example is the market basket analysis where association between different consumer products is figured out which can
assist in taking effective business and marketing decisions. Other application domains which provide large data sets
where ARM can be applied are finance, insurance, banking, fraud detection, medical, bioinformatics, demographic
studies, telecommunication, GIS, remote sensing, e-commerce and retailing. More recently association rule mining is
also applied to areas like pharmaceutics, law and justice, aviation management, agriculture, weather forecast etc.
Let there are T transactions in database D and X and Y are disjoint itemsets containing collection of items i.e.
there intersection is null, (X ∩ Y = ∅). An association rule can be written in form X → Y, where X is the antecedent (left
hand side of the rule) and Y is the consequent (right hand side). A rule may contain more than one item in antecedent and
consequent of rule. The strength and reliability of an association rule is measured by two factors: support and confidence.
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 752
Support (prevalence) is percentage of database transactions that contains X and Y or it can be viewed as the
probability where X and Y occur together i.e. σ (X∪Y). Support s for rule (X→Y) can be calculated as:
Support(s) for (X →Y) = σ X ∪ Y
N (1)
Confidence (predictability) is the percentage of database transactions containing X that also contain Y. In other words, it
could be seen as the conditional probability, σ(Y|X). It can be calculated as:
Confidence(c) for (X → Y) = σ(X ∪ Y )
σ(X) (2)
Support provides statistical significance to the rule. If it is too low then it may be possible that the rule has occurred
mere by chance. On the other hand, confidence measures reliability or predictability of the rule. If it is kept high then one
can easily infer that Y is also present in transactions containing X. Therefore, to select only those rules which have high
interestingness threshold levels are set on support and confidence values, called as minsup and minconf, respectively.
Generally a low minsup and a high minconf are set to ensure that all the possible interesting rules have been mined.
Association rules are mined in two phases. In first step (Frequent itemset generation), using minsup all the itemsets are
found whose support is greater than minsup. Such itemsets are called frequent itemsets. In the next phase, all the rules are
pruned from frequent itemsets, who satisfy the minconf threshold (Rule generation) [8].
1)Apriori Algorithm: Many algorithms have been proposed for association rule mining. But the eminent one remains the
Apriori Algorithm, proposed by Agrawal et. al in 1994 [9]. This has remained the much studied and researched algorithm
even after many years of its introduction. Many advancements and extensions have been proposed for this algorithm, but
its applicability to many areas has still to be utilized.
Apriori algorithm works on the principle of downward closure property or anti monotone property. In order to
generate frequent itemsets by searching all the possible itemsets, whole database needs to be scanned. To reduce the
number of candidate itemsets during frequent itemset generation, anti monotone property is used. It states that if an
itemset is frequent then all its subsets will also be frequent or if an itemset is not frequent then its supersets are also not
frequent. Let P be the power set and X be the subset of Y. Reference [8] shows that a measure f is anti monotone if
∀ X, Y ∈ P: (X ⊆ Y) → f(Y) ≤ f(X).
Apriori algorithm uses breadth-first technique to search the candidate itemsets. It uses itemsets with k-1 length to
generate itemsets of k length (join step). Then it uses the anti monotone property to generate frequent itemsets (prune
step). Association rules can be generated by using frequent itemsets such that X → Y-X. Those rules whose confidence
does not satisfy minconf threshold are dropped out and only the remaining strong rules are chosen.
2)Pseudocode: The pseudo code for the algorithm is stated as follows:
ALGORITHM. Apriori
Input: D, a database of transactions; minsup, the minimum support count threshold.
Output: Lk, frequent itemsets in D.
L1= {frequent 1-itemsets};
for(k= 2; Lk-1 !=∅; k++) {
Ck = candidates generated from Lk-1
//that iscartesian product Lk-1 x Lk-1 and eliminating any k-1 size itemset that
//is not frequent
for each transaction t in database do{
#increment the count of all candidates in Ck that are contained in t
Lk = candidates in C k with minsup
}//end for each
}//end for
return ⋃kLk;
}
II. AIM AND OBJECTIVES
Land and water are undoubtedly the two major natural resources which are essential for the very existence of life.
With the increase of population the demand for land has raised many folds. Therefore, objective of the study is to find
those barren lands having a substantial ground water level, so that these lands can be used for cultivation of crops and
fodder for animals. The study aims to unearth association rules between ground water and wastelands of Jodhpur District.
The outcomes will reveal some useful patterns helping us to relate ground water and wastelands.
III. RESEARCH METHODOLOGY
A.Study Area
Jodhpur district comes under arid zone of the Rajasthan situated between 250 51’ 08” & 27
0 37’ 09” North latitude and
710
48’ 09” & 730
52’ 06” East longitude. It covers 11.60% of total arid area of the state. Jodhpur district, part of
Jodhpur Division covers a geographical area of 2256405 hectares and is divided into 5 sub-divisions that are
Jodhpur, Shergarh, Pipar City, Osian & Phalodi. The district has 07 tehsils & 09 blocks. The district is bounded by
Bikaner in North, Nagaur in East, Jaisalmer in west, and Barmer and Pali in the South.
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 753
Fig .1. Map showing study area location
B.Data Collection
The study required information about land use, ground water and soil in the study area in GIS format. For the
proposed system the data has been collected from Indian Space and Research organization (ISRO) Jodhpur Center. The
center provided the data for land use, ground water and soil for Jodhpur district for the year 2005 in GIS format.
The different types of dataset and their basic characteristics pertaining to this study are briefly described as follows:
1) Landuse Data of Jodhpur District: Land use Map of Jodhpur shows the division of land into Agricultural Land,
Built-up, Forest, Waste-land, Water bodies and Wetlands.
2)Ground Water Data of Jodhpur District: Jodhpur District is classified into different regions depending upon on the
level and quality of ground water viz. Good, Good but saline, Good to Moderate, Moderate, Moderate to Poor, Poor,
poor to Nil, Saline, Settlement, Very Good to Good and Water Body mask.
C. Tools/ Softwares used
ArcMap 10 is used for creating thematic maps and overlays. Weka 3.6 is used for generating Association rules.
D. Methods
The methodology developed for this study is shown below in figure 2. Each block represents the sub-processing step
to reach up to the final output.
Fig. 2. Overall approach of the study.
1)Pre-processing of Data: The spatial datasets are preprocessed to create a transactional database before association rule
mining can be applied. The preprocessing of spatial data may include selection of non spatial attributes, feature selection,
dimension reduction, carrying out join, union or intersection operations, data categorization etc [10].The study required
two different types of data set for ground water and waste lands. The pre-processing of data was carried in three steps:
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 754
a.Thematic layer with the required attributes is created for waste land data.
b.Thematic layer with the required attributes is created for ground water data.
c.Intersection is performed on the waste land and ground water layers to get a new intersection layer and a new thematic
layer is created that shows those areas of Jodhpur district which are either salt-affected waste lands or waste lands
without scrubs having good ground water beneath. The details of the above pre-processing steps are as follows:
a. Thematic Layers for Waste Land
Land use data of Jodhpur district as provided by the ISRO center Jodhpur, classifies the land use into following types:
Agricultural Land, Built-up, Forest, Waste-land, Water bodies and Wetlands. The table I shows the land use pattern
in the order of decreasing area and figure 3 shows the land use map.
Table I: Land Use Pattern
Fig. 3. Land Use map of Jodhpur District
Out of all the above classified lands, the study focuses on waste-lands only. Therefore, to get the waste-land distribution
pattern a new thematic layer is prepared showing only waste lands. The figure 4 and table II show the newly created
thematic layer for waste land only. The layer shows that the waste lands are again classified into Sandy-desertic Land,
Salt Affected, Land Mining/ Industrial waste, Land without scrub, Land with scrub, Gullied/Ravenous Land,
Barren Rocky/ Stony waste land.
Table II: Waste Land Pattern
Fig. 4. Waste Land distribution of Jodhpur District
Among all the types of waste lands only waste lands that are either salt affected or without scrubs are chosen for further
study. The reason behind it is that all other types of waste-lands are either already contain some vegetation(Land with
scrub) or are not suitable for growing any type of vegetation(Sandy-desertic Land, Land Mining/ Industrial waste,
Gullied/Ravenous Land, Barren Rocky/ Stony waste land). Therefore, a new thematic layer for “Land Without Scrubs”
and “Salt Affected Waste Land” is created. The figure 5 and table III show this layer.
Table III: Waste Land (Salt affected/ Without Scrub)
Fig. 5. Waste Land (Salt affected/Without Scrub) distribution.
Land –Type Area(Hectares)
Agriculture 1940925.7
Waste-lands 675378.7
Built-up 29594
Water bodies 20406.8
Forest 14164.8
Wetlands 6110.4
Waste-land Type Area(Hectares)
Sandy-desertic Land 213737.9
Land without scrub 155328.8
Land with scrub 154027
Barren Rocky/Stony
waste 141733.4
Mining Industrial waste 4017.7
Salt Affected Land 3716.7
Gullied/Ravenous Land 2816.9
Wasteland Area(Hectares)
Land without
scrub 155328.80
Salt Affected
Land 3716.73
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 755
Thus, the above process can be summarized as:
Fig. 6. Thematic layers of Land use data
b. Thematic Layers for Ground water
Ground water data, as provided by the ISRO Center, Jodhpur is classified into different types like Good, Good but
saline, Good to Moderate, Moderate, Moderate to Poor, Poor, poor to Nil, Saline, Settlement, Very Good to Good
and Water Body mask. Based on this classification Jodhpur District is divided into these regions .This distribution of
ground water is shown in the figure 7 and table IV.
Table IV: Ground water Pattern
Fig. 7. Ground water distribution of Jodhpur.
Out of these classified regions, only those regions of Jodhpur District are selected having Good, Good but saline,
Good to Moderate and Very Good to Good ground water level. As a next step, new thematic layer for ground water is
created containing only the selected attributes as showed in figure 8 and table V.
Table V: Good ground water Pattern
Figure 8: Good ground water distribution of Jodhpur District.
Ground Water Area (Hectares)
Good 40115.58
Good but Saline 11168.37
Good to moderate 27975.06
Moderate 582198.82
Moderate to Poor 1460483.16
Poor 313362.70
Poor to Nil 98935.60
Saline 3345.65
Settlement 31270.99
Very good to good 266028.01
Water Body Mask 21649.42
Ground Water Area(Hectare
s) Good 40115.58
Good but Saline 11168.37
Good to moderate 27975.07
Very good to good 266028.01
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 756
Thus, the above process can be summarized as shown in figure 9:
Fig. 9. Thematic layers of Ground water data.
c. Overlays and Intersection of Thematic Layers
As the next step overlay maps of waste lands (salt affected and without scrubs) and good ground water is created.
An overlay operation is much more than a simple merging of linework, all the attributes of the features taking part in the
overlay are carried through, as shown in the figure 10 below, where wastelands (polygons) and good ground water
(polygons) are overlayed to create a new polygon layer.
Fig. 10. Overlay Map of Wasteland (Salt affected/ Without Scrub) and Good Ground Water.
Then a new layer is created for those areas of the district having waste lands which are salt affected or without scrub
and have good ground water beneath, by using intersection. The newly constructed layer is shown in the figure 11. Table
VI shows the area under mining pattern.
Table VI: Area under mining pattern.
Fig.11. Intersect Map of Wastelands (Salt affected/Without Scrub)
and Good Ground Water.
2)Association Rules Generation: For generating Association rules, a tool called Weka 3.6 is used. The database file
obtained from the above map (figure 11) is converted into ARFF format on which association rules are generated using
Apriori algorithm.
IV. RESULTS AND DISCUSSION Apriori algorithm was run in Weka using the arff file created after the preprocessing of data. Three attributes were
chosen viz. Taluk, WasteLandType and GroundWaterType from the database file as predicates. Six itemsets of size1, 7
itemsets of size 2 and 2 itemsets of size 3 were discovered from a total of 285 instances of data in 17 cycles. Minimum
support and minimum confidence kept were 15% (0.15) and 90% (.9) respectively. Tables VII,VIII and IX show large
item sets found in the data.
Waste Land Area(Hectare
s)
Land without
scrub 13308.98
Salt Affected
Land 329.96
Total 13638.94
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 757
Table VII. Large Itemsets L(1)
Item 1 Count
Taluk=Bilara 99
Taluk=Jodhpur 76
Taluk=Phalodi 61
WasteLandType=Landwithout scrub 280
GroundWaterType=Very good to good 181
GroundWaterType=Good 44
Table VIII. Large Itemsets L(2)
Item 1 Item 2 Count
Taluk=Bilara WasteLandType=Land without scrub 99
Taluk=Bilara GroundWaterType=Very good to good 81
Taluk=Jodhpur WasteLandType=Land without scrub 76
Taluk=Jodhpur GroundWaterType=Very goog to good 75
Taluk=Phalodi WasteLandType=Land without scrub 57
WasteLandType=Land without scrub GroundWaterType=Very good to good 181
WasteLandType=Land withut scrub GroundWaterType=Good 44
Table IX. Large Itemsets L(3)
Item 1 Item 2 Item 3 Count
Taluk=Bilara WasteLandType=Land without
scrub
GroundWaterType=Very good to
good 81
Taluk=Jodhpur WasteLandType=Land without
scrub
GroundWaterType=Very good to
good 75
The best rules found after applying Apriori algorithm are listed in the table X below.
Table X. Association Rules Mined for Ground Water and Waste Lands of Jodhpur District.
S.No. Body Implies Head Support
Conf %
1. GroundWaterType=Very good to
good
==> WasteLandType=Land without scrub 81 100
2. Taluk=Bilara ==> WasteLandType=Land without scrub 99 100
3. Taluk=Bilara
GroundWaterType=Very good to
good 81
==> WasteLandType=Land without scrub 81 100
4. Taluk=Jodhpur 76 ==> WasteLandType=Land without scrub 76 100
5. Taluk=Jodhpur
GroundWaterType=Very good to
good 75
==> WasteLandType=Land without scrub 75 100
6. GroundWaterType=Good 44 ==> WasteLandType=Land without scrub 44 100
7. Taluk=Jodhpur 76 ==> GroundWaterType=Very good to
good
76 99
8. Taluk=Jodhpur
WasteLandType=Land without
scrub 76
==> GroundWaterType=Very good to
good
76 99
9. Taluk=Jodhpur 76 ==> WasteLandType=Land without scrub
GroundWaterType=Very good to
good
76 99
10. Taluk=Phalodi 61 ==> WasteLandType=Land without scrub 61 93
Results show that 13638.94 hectares of land fall under mining pattern. Analysis of results is shown in form of a graph
in figure 12. It shows that Bilara has the maximum (6481.05 hectares) waste lands distribution of the mined pattern. The
area mined is substantially a large one that can be utilized for vegetation production using the water underneath. The
same results presented above are obtained by implementing the WEKA Apriori Algorithm in own Java code.
Faridi et al., International Journal of Advanced Research in Computer Science and Software Engineering 5(6),
June- 2015, pp. 751-758
© 2015, IJARCSSE All Rights Reserved Page | 758
Fig.12. Graph showing distribution of Wastelands in taluks of Jodhpur District.
V. CONCLUSION
The analysis of pattern shows that majority of wastelands without scrubs having very high groundwater lie in Bilara
region of Jodhpur District. Having good amount of water underneath, these lands can be used to produce firewood and
fodder for animals. Plant species like Acacia jacquemontii, Acacia leucophloea, Acacia senegal, Albizia lebbeck,
Azadirachta indica, Anogeissus rotundifolia, Prosopis cineraria, Salvadora oleoides, Tecomella undulata, Tamarix
articulate, Leucaena leucocephala, Tephrosia purpurea and Crotalaria medicaginea can be grown. Farmers can be
advised to cultivate crops using ground water irrigation. If we know that a land has good ground water level, then land
can be irrigated using this water. Even if the water underneath is saline, then also salt resistant species of plants can be
grown. In this way we can effectively utilize waste-lands.
VI. FUTURE WORK
A wide variety of research is being carried in the field of spatial data mining.
As the next level of this research, Fuzzy Spatial Association Rules could be determined.
Soil and crop data could also be used along with the ground water and wasteland data.
Also spatio-temporal association rules could be determined as an extension to this current research.
Hence, a lot of research is needed to be carried out in these emerging areas, focusing on its applicability to agriculture,
data mining and GIS, which will provide means for better utilization of natural resources.
ACKNOWLEDGMENT
The authors would like to thank ISRO, Jodhpur Centre for providing necessary data about the research scenario.
REFERENCES
[1] Mainaz Faridi, Seema Verma and Saurabh Mukherjee. 2012. Impact of ground water level and its quality on
fertility of land using GIS and Agriculture Business Intelligence. In Proceedings of Geomatrix’12- An
International Conference on Geospatial Technologies and Applications, IIT Bombay (Feb 2012).
[2] Yuan, May, B. Buttenfield, M. Gahegan, and Harvey Miller. 2004. Geospatial data mining and knowledge
discovery. Chapter 14 (2004): 365-388.
[3] Krzysztof Koperski, and Jiawei Han. 1995. Discovery of spatial association rules in geographic information
databases. Advances in spatial databases, Springer Berlin Heidelberg. vol 6, 47-66.
[4] Stan Openshaw. 1999. Geographical data mining: key design issues. In Proceedings of GeoComputation, vol.
99.
[5] Krzysztof Koperski, Jiawei Han, and Nebojsa Stefanovic. 1998. An efficient two-step method for classification
of spatial data. In Proceedings of International Symposium on Spatial Data Handling (SDH 1998), Vancouver,
BC, Canada. 45-54.
[6] Hong Tang and Simon McDonald. 2002. Integrating GIS and spatial data mining technique for target marketing
of university courses. In ISPRS Commission IV, Symposium, Ottawa Canada, (Jul 2002).
[7] D. Rajesh. 2011. Application of Spatial Data Mining for Agriculture. International Journal of Computer
Applications 15,2 (2011), 7-9.
[8] Tan, Pang-Ning, and Vipin Kumar. 2005. Chapter 6. Association Analysis: Basic Concepts and Algorithms."
Introduction to Data Mining. Addison-Wesley. ISBN 321321367 (2005).
[9] Rakesh Agrawal, and Ramakrishnan Srikant. 1994. Fast algorithms for mining association rules. In
Proceedings of 20th int. conf. very large data bases, VLDB, (1994), vol. 1215, 487-499.
[10] Chen, Junming, Guangfa Lin, and Zhihai Yang. 2011. Extracting spatial association rules from the maximum
frequent itemsets based on Boolean matrix. In Geoinformatics, 2011 19th International Conference on, IEEE
(2011), 1-5.