Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of...
Spatial Data Mining
• Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set.
Examples of databases where spatial data mining is useful are:
Earth Observation Satellites (terabyte per day)
Census
Weather systems
Marketing
so on….
Examples of Spatial Patterns• Historic Examples
– 1855 Asiatic Cholera in London :
A water pump identified as the source…
Modern Examples Cancer clusters to investigate environment health hazards…
Crime hotspots for planning police patrol routes…
Bald eagles nest on tall trees near open water…
Unusual warming of Pacific ocean (El Nino)
Spatial Data Mining
• Data mining is a combination of processes :– Data extraction
– Data clean
– Selection of characteristics
– Algorithms
– Analysis of results
• Important characteristic to explore for spatial data mining: similar objects tend to be spatially close.
Data Mining: Process
Expert AnalystAssociationClusteringClassification
DB
Problem
AlgorithmsData Mining
OGIS SQL
DB
VerificationRefinementVisualization
adjustment technique
feedback
interpretation action
Hypothesis
Statistics versus Data Mining
• Data mining is strongly related to statistical analysis.
• Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool.
• Data mining generates hypothesis that are then verified.
• The filtering process do not guarantee completeness (wrong elimination or missing).
A Classification of Data Mining Processes
• The three most common process of data mining are:– Association rules: Determination of interaction between
attributes. For example:• X Y:
– Classification: Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are:
• Predicting locations (shopping centers, habitat, crime zones)
• Thematic classification (satellite images)
– Clustering: It is a form of learning without supervision, where classes and the number of classes are unknown. Applications:
Association Rules
• A spatial association rule is a rule indicating certain
association relationship among a set of spatial and possibly some non-spatial predicates.
Example:
“Most big cities in Canada
are close
to the Canada U.S. border”
A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships.
Association Rules
• Spatial association rules are defined in terms of spatial predicates: P1 P2 … P1 Q1 Q2 … Qm
For example:
is_a(x, country) close(x,Mediterranean) s%,c% is(x, wine - exporter)
where for i1 i2:
s%: i1 and i2 occur at least s% of cases
c%: among all cases when i1 occurs, at least c% of the times i2 also occurs.
Association Rules: A priori
• Principle: If an item set has a high support, then so do all its subsets.
• The steps of the algorithm is as follows:– first,discover all 1-itemsets that are frequent
– combine to form 2-itemsets and analyze for frequent set
– go on until no more itemsets exceed the therehold.
– search for rules
Association Rules: A priori
CD DAlarm ATV TVCR VComputador C
items Cases
1 D A V C2 A T C3 D A V C4 D A T C5 D A T V C6 A T V
Association Rules : A priori
100% (6) A83% (5) C, A C67 % (5) C, T, V, D A
DC,AT,AV,DAC50% (3) DV,TC,VC,DAV,
DVC,ATC,AVC,DAVC
Frequency of itemsets
Association Rules: A priori
Confidence of association rules = 100%
D A (4/4)D C (4/4)D AC (4/4)T C (4/4)V A (4/4)C A (5/5)
D A (4/4)D A (3/3)D A (3/3)D A (4/4)D A (3/3)D A (3/3)
VC A (3/3)DV A (3/3)VC A (3/3)DAV A (3/3)DVC A (3/3)AVC A (3/3)
C D (4/5) A C (5/6) C DA(4/5)
Association rules with confidence >= 80%
Association rules
• Differences with respect to spatial domain:
– The notion of transaction or case does not exist, since data are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation or subestimation confidences.
– The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbors dominates the final computational cost.
– In most cases, the spatial items are discrete version of continuous variables.
– The notion of transaction is replaced by neighborhood.
Example GeoMiner query:
discover spatial association rules inside British Columbia from road R, water W, mines M, boundary B in relevance to town T where
g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway} and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.”
Discovery of Spatial Association Rules (cont.)
Note: “close_to” is a condition dependent predicate and is defined by a
set of knowledge rules.
For example, the following rule states:
X is a town
and
Y is a country
X is close_to Y,
if their distance
is within 80 kms
then
Rule:
close_to(X,Y ) is_a(X, town) is_a(Y, country) dist(X, Y, d) d = 80 km
Discovery of Spatial Association Rules (cont.)
Water
RiverSea Lake
Large river
Small river
Fraser river
Level
1
2
3
4
Hierarchy for data relations
Hierarchy of topological relations
Discovery of Spatial Association Rules (cont.)
g_close_to
Close_to
Intersects Inside
Adjacent to
Equal Contains
Intersects
Covered by
Inside
Covers
Contains
Not_disjoint
Step 1: Task_relevant_DB := extract task relevant objects(SDB, RDB);
(1) Towns
(2) Roads
(3) Water
(4) Mines
(5) boundary
only large towns;
only divided highways 2 ;
only seas, oceans, large lakes and large rivers; any mines;
only the boundary of B.C., and U.S.A.
The set of relevant data is retrieved by execution of the data retrievalmethods of the data mining query, which extracts the following data sets whose spatial portion is inside B.C.:
... g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway}and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.” ...
Discovery of Spatial Association Rules (cont.)
Step 2: Coarse_predicate_DB :=
coarse spatial computation(Task_relevant_DB);
Town Water Road Boundary MineVictoria Juan_de_Fuca_Strait Highway_1,
Highway_17US
Saanich Juan_de_Fuca_Strait Highway_1, Highway_17
US
Prince Highway_97
Pentincton Okanagan_Lake Highway_97 US Allala
… … … … …
At this level we can already mine:is_a(X, large_town) g_close_to(X, water): (80%) is_a(X, large_town) g_close_to(X, sea) g_close_to (X, us_boundary):(92%)
The “generalized_close_to” (g_close_to) relationship between (large) towns and the other four classes of entities is computed at a relatively coarse resolution level. [MBR data structure or R* trees and other approximations] – Later…
Discovery of Spatial Association Rules (cont.)
Step 3: Large_Coarse_predicate_DB :=
filtering_with_min_support(Coarse predicate DB);
Town Water Road BoundaryVictoria <adjacent_to,
Juan_de_Fuca_Strait><intersects, Highway_1>, <intersects, Highway_17>
<close_to, US>
Saanich <adjacent_to, Juan_de_Fuca_Strait>
<intersects, Highway_1>, <close_to, Highway_17>
<close_to, US>
Prince <intersects, Highway_97>
Pentincton
<adjacent_to, Okanagan_Lake > <intersects, Highway_97> <close_to, US>
… … … …
Refined computation is performed on the large predicate sets, i.e., those retained in the g_close_to table. Each g_close_to predicate is replaced by one or a set of concrete predicate(s) such as intersect, adjacent_to, close_to, inside, etc.
Discovery of Spatial Association Rules (cont.)
Step 4: Large_Coarse_predicate_DB :=
filtering_with_min_support(Coarse predicate DB);
23<adjacent_to, water>, <close_to, us_boundary >2
26<intersects, highway>, <close_to, us_boundary >2
29<intersects, highway>1
29<close_to, highway >1
28<close_to, us_boundary >1
countLarge k-predicate setk
32<adjacent_to, water>1
25<adjacent_to, water>, <intersects, highway>2
22<adjacent_to, water>, <intersects, highway>, <close_to, us_boundary >3
Min support =
20 for level 1
The level by level detailed computation of large predicates and the corresponding association rules is presented as follows. The computation starts at the top most concept level and computes large predicates at this level.
Discovery of Spatial Association Rules (cont.)
Step 5: Find large predicates and mine rules(Fine_predicate_DB);
24<close_to, provincial_ highway >1
11<adjacent_to, large_river>1
11<adjacent_to, sea>, <close_to, provincial_highway >2
22<close_to, us_boundary >, <close_to, provincial_highway >2
28<close_to, us_boundary >1
21<intersects, provincial_highway>1
15<adjacent_to, sea>, <close_to, us_boundary >2
count
Large k-predicate setk
21<adjacent_to, sea>1
19<close_to, us_boundary >, <intersects, provincial_ highway>2
10<adjacent_to, sea>, < close_to, provincial_highway >, <close_to, us_boundary >3
Min support = 10
for level 2
Level 2:
After mining rules at the highest level of the concept hierarchy, large k predicates can be computed in the same way at the lowerconcept levels, which results in tables
Discovery of Spatial Association Rules (cont.)
k Large k-predicate set count
1 <adjacent_to, georgia_straight> 9
1 <adjacent_to, fraser_river> 10
1 <close_to, us_boundary > 28
2 <adjacent_to, georgia_straight>, <close_to, us_boundary > 7
Level 3: Min support = 7 for level
3
The mining process stops at the lowest level of the hierarchies or when an empty large 1 predicate set is
derived.
A rule example:is_a(X, large town) adjacent(X, sea) close_to (X, us_boundary) : (100%)
Discovery of Spatial Association Rules (cont.)
Classification and Regression
• Classification:– constructs a model (classifier) based on
the training set and uses it in classifying new data
– Example: Climate Classification,…
• Regression:– models continuous-valued functions, i.e.,
predicts unknown or missing values– Example: stock trends prediction,…
Classification
• Definition : D L
where D is the domain of , i.e., the domain of attribute values and L is the set of levels or classes. For example, in a problem of habitat of birds, D is a space of three dimensions: longevity of the vegetation, depth of water, and distance to water. L has two possible values: nest and no nest.
• The goal is to find a good .
Classification (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6
THEN tenured = ‘yes’
Classifier(Model)
Classification (2): Prediction Using the Model
Classifier
TestingData Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification Techniques
• Decision Tree Induction• Bayesian Classification• Neural Networks• Genetic Algorithms• Fuzzy Set and Logic
Regression
• Regression is similar to classification– First, construct a model– Second, use model to predict unknown value
• Methods– Linear and multiple regression– Non-linear regression
• Regression is different from classification– Classification refers to predict categorical
class label– Regression models continuous-valued
functions
Predicting Location Using Similarity Map
• Given:– S is a set of locations { s1,…sn} in a geographic space G.
– A collection of exploratory functions xk : S Rk, where Rk , k = 1 .. K, is the range of possible values for the exploratory functions.
– A dependent class variable c : S C= c1,..cM
– A value for parameter , the relative importance of spatial accuracy.
• Find: classification model: c : R1 x …Rk C
• Objective: Maximize similarity (map si S (c (x1,… xk)),map(c )) = (1- ) classification accuracy(c , c ) + spatial accuracy (c , c )
Predicting Locations Using Similarity Map
• Constraints:– the geographic space S is the Euclidean space
– The values of exploratory functions, x1.. xk , dependent classvariable, c , can depend on the neighbors’ values (spatial auto-correlation)
– The domain Rk of the exploratory ina domain of real numbers– The domain of dependent variable C = {0,1}.
• Two characteristics:– Spatial autocorrelation– The objective function combines spatial and classification accuracy.
Clustering• It is the process of finding groups, without knowing in advance
the number and the labels of the groups.• Examples: the counties in Chile can be clustered based on 4
attributes: – Porcentaje de desempleo– Población– Ingreso por cápita– Expectativa de vida
• Two types of clustering with different objectives are: – Identify the central cities and their influence region by means of the
variance of the attribute values within the space. – Identify areas in the space where an attribute is homogeneous,
Clustering• Definition 1:
– Given a set of S = {s1,..sn} spatial objects (ex., points) and a real valued no spatial attribute evaluated over S (: S R).
– Find two disjoint subsets S, C and NC = S - C, where C = {s1,…,sk}, NC = {nc1,…,ncl} y k < n
– Goal min C S ∑l j=1 | (ncj) - ∑k
i=1 ((ci )/ dist(ncj,ci))|2
– where dist(a,b) is the Euclidian distance or some distance measure Constraints:
• It satisfies that the influence of the center decreases with the square of the distance
• There is at most one non spatial attribute
Clustering• Definition 2:
– Given a set of S = {s1,..sn} spatial objects objests, a set of real valued no spatial attributes i,con i = 1,… I defined over S (k: S R) and a structure of neighborhood E in S.
– Find K subsets Ck S, with k = 1 .. K, such that
– Goal min Ck S ∑ ck,si Ck,sj
Ck dist(F(si),F(sj))+ ∑ I,j nbddist(Ci,Cj)
– where F is the cross product of ’is, I = 1..n; dist(a,b) is the distance measure and nbddist(C,D) is the number of points in C and D that belong to E, I.e., pair of neighbors mapped to different clusters.
– Constraints: |Ck| > 1 for all k = 1 .. K
Clustering: categories• Hierarchical methods: Starting with a cluster, successive
partitions are made until a criterion is satisfied. These algorithms result in a tree of clusters called dendograms.
• Partitional: It considers each pattern as a cluster and then, reallocate data in each cluster until a criterion is satisfied. This methods tend to find clusters of spherical shape.
• Density-based: It finds clusters based on the density of points in a regions.
• Grid-based: It partitions the space in cells and then, it performs the required operations on the quantized space. Cells that contain many points are considered dense and connected to create clusters.
Clustering in SDB• The idea is to make use of the indexing. If the SDB is large, not
all the points will fit in main memory. • For example, for an algorithm that requires n initial points to
represent n clusters, a natural idea is to incorporate the notion of containment in the indexing definition to find the closest objects.
• A method that finds a centroid of subdivisions of the space is the Voronoi triangulation.