Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of...

Spatial Data Mining

• Data mining is a process of knowledge discovery related to finding patterns of interest within a large data set.

Examples of databases where spatial data mining is useful are:

Earth Observation Satellites (terabyte per day)

Census

Weather systems

Marketing

so on….

Examples of Spatial Patterns• Historic Examples

– 1855 Asiatic Cholera in London :

A water pump identified as the source…

Modern Examples Cancer clusters to investigate environment health hazards…

Crime hotspots for planning police patrol routes…

Bald eagles nest on tall trees near open water…

Unusual warming of Pacific ocean (El Nino)

Spatial Data Mining

• Data mining is a combination of processes :– Data extraction

– Data clean

– Selection of characteristics

– Algorithms

– Analysis of results

• Important characteristic to explore for spatial data mining: similar objects tend to be spatially close.

Data Mining: Process

Expert AnalystAssociationClusteringClassification

DB

Problem

AlgorithmsData Mining

OGIS SQL

DB

VerificationRefinementVisualization

adjustment technique

feedback

interpretation action

Hypothesis

Statistics versus Data Mining

• Data mining is strongly related to statistical analysis.

• Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool.

• Data mining generates hypothesis that are then verified.

• The filtering process do not guarantee completeness (wrong elimination or missing).

A Classification of Data Mining Processes

• The three most common process of data mining are:– Association rules: Determination of interaction between

attributes. For example:• X Y:

– Classification: Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are:

• Predicting locations (shopping centers, habitat, crime zones)

• Thematic classification (satellite images)

– Clustering: It is a form of learning without supervision, where classes and the number of classes are unknown. Applications:

Association Rules

• A spatial association rule is a rule indicating certain

association relationship among a set of spatial and possibly some non-spatial predicates.

Example:

“Most big cities in Canada

are close

to the Canada U.S. border”

A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships.

Association Rules

• Spatial association rules are defined in terms of spatial predicates: P1 P2 … P1 Q1 Q2 … Qm

For example:

is_a(x, country) close(x,Mediterranean) s%,c% is(x, wine - exporter)

where for i1 i2:

s%: i1 and i2 occur at least s% of cases

c%: among all cases when i1 occurs, at least c% of the times i2 also occurs.

Association Rules: A priori

• Principle: If an item set has a high support, then so do all its subsets.

• The steps of the algorithm is as follows:– first,discover all 1-itemsets that are frequent

– combine to form 2-itemsets and analyze for frequent set

– go on until no more itemsets exceed the therehold.

– search for rules


CD DAlarm ATV TVCR VComputador C

items Cases

1 D A V C2 A T C3 D A V C4 D A T C5 D A T V C6 A T V

Association Rules : A priori

100% (6) A83% (5) C, A C67 % (5) C, T, V, D A

DC,AT,AV,DAC50% (3) DV,TC,VC,DAV,

DVC,ATC,AVC,DAVC

Frequency of itemsets


Confidence of association rules = 100%

D A (4/4)D C (4/4)D AC (4/4)T C (4/4)V A (4/4)C A (5/5)

D A (4/4)D A (3/3)D A (3/3)D A (4/4)D A (3/3)D A (3/3)

VC A (3/3)DV A (3/3)VC A (3/3)DAV A (3/3)DVC A (3/3)AVC A (3/3)

C D (4/5) A C (5/6) C DA(4/5)

Association rules with confidence >= 80%

Association rules

• Differences with respect to spatial domain:

– The notion of transaction or case does not exist, since data are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation or subestimation confidences.

– The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbors dominates the final computational cost.

– In most cases, the spatial items are discrete version of continuous variables.

– The notion of transaction is replaced by neighborhood.

Example GeoMiner query:

discover spatial association rules inside British Columbia from road R, water W, mines M, boundary B in relevance to town T where

g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway} and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.”

Discovery of Spatial Association Rules (cont.)

Note: “close_to” is a condition dependent predicate and is defined by a

set of knowledge rules.

For example, the following rule states:

X is a town

and

Y is a country

X is close_to Y,

if their distance

is within 80 kms

then

Rule:

close_to(X,Y ) is_a(X, town) is_a(Y, country) dist(X, Y, d) d = 80 km


Water

RiverSea Lake

Large river

Small river

Fraser river

Level

1

2

3

4

Hierarchy for data relations

Hierarchy of topological relations


g_close_to

Close_to

Intersects Inside

Adjacent to

Equal Contains

Intersects

Covered by

Inside

Covers

Contains

Not_disjoint

Step 1: Task_relevant_DB := extract task relevant objects(SDB, RDB);

(1) Towns

(2) Roads

(3) Water

(4) Mines

(5) boundary

only large towns;

only divided highways 2 ;

only seas, oceans, large lakes and large rivers; any mines;

only the boundary of B.C., and U.S.A.

The set of relevant data is retrieved by execution of the data retrievalmethods of the data mining query, which extracts the following data sets whose spatial portion is inside B.C.:

... g_close_to(T.geo, X.geo) and X in {R, W, M, B} and T.type = “large” and R.type in {divided_highway}and W.type in {sea, ocean, large_lake, large_river} and B.admin_region_1 in “B.C.”and B.admin_region_2 in “U.S.A.” ...


Step 2: Coarse_predicate_DB :=

coarse spatial computation(Task_relevant_DB);

Town Water Road Boundary MineVictoria Juan_de_Fuca_Strait Highway_1,

Highway_17US

Saanich Juan_de_Fuca_Strait Highway_1, Highway_17

US

Prince Highway_97

Pentincton Okanagan_Lake Highway_97 US Allala

… … … … …

At this level we can already mine:is_a(X, large_town) g_close_to(X, water): (80%) is_a(X, large_town) g_close_to(X, sea) g_close_to (X, us_boundary):(92%)

The “generalized_close_to” (g_close_to) relationship between (large) towns and the other four classes of entities is computed at a relatively coarse resolution level. [MBR data structure or R* trees and other approximations] – Later…


Step 3: Large_Coarse_predicate_DB :=

filtering_with_min_support(Coarse predicate DB);

Town Water Road BoundaryVictoria <adjacent_to,

Juan_de_Fuca_Strait><intersects, Highway_1>, <intersects, Highway_17>

<close_to, US>

Saanich <adjacent_to, Juan_de_Fuca_Strait>

<intersects, Highway_1>, <close_to, Highway_17>

<close_to, US>

Prince <intersects, Highway_97>

Pentincton

<adjacent_to, Okanagan_Lake > <intersects, Highway_97> <close_to, US>

… … … …

Refined computation is performed on the large predicate sets, i.e., those retained in the g_close_to table. Each g_close_to predicate is replaced by one or a set of concrete predicate(s) such as intersect, adjacent_to, close_to, inside, etc.


Step 4: Large_Coarse_predicate_DB :=

filtering_with_min_support(Coarse predicate DB);

23<adjacent_to, water>, <close_to, us_boundary >2

26<intersects, highway>, <close_to, us_boundary >2

29<intersects, highway>1

29<close_to, highway >1

28<close_to, us_boundary >1

countLarge k-predicate setk

32<adjacent_to, water>1

25<adjacent_to, water>, <intersects, highway>2

22<adjacent_to, water>, <intersects, highway>, <close_to, us_boundary >3

Min support =

20 for level 1

The level by level detailed computation of large predicates and the corresponding association rules is presented as follows. The computation starts at the top most concept level and computes large predicates at this level.


Step 5: Find large predicates and mine rules(Fine_predicate_DB);

24<close_to, provincial_ highway >1

11<adjacent_to, large_river>1

11<adjacent_to, sea>, <close_to, provincial_highway >2

22<close_to, us_boundary >, <close_to, provincial_highway >2

28<close_to, us_boundary >1

21<intersects, provincial_highway>1

15<adjacent_to, sea>, <close_to, us_boundary >2

count

Large k-predicate setk

21<adjacent_to, sea>1

19<close_to, us_boundary >, <intersects, provincial_ highway>2

10<adjacent_to, sea>, < close_to, provincial_highway >, <close_to, us_boundary >3

Min support = 10

for level 2

Level 2:

After mining rules at the highest level of the concept hierarchy, large k predicates can be computed in the same way at the lowerconcept levels, which results in tables


k Large k-predicate set count

1 <adjacent_to, georgia_straight> 9

1 <adjacent_to, fraser_river> 10

1 <close_to, us_boundary > 28

2 <adjacent_to, georgia_straight>, <close_to, us_boundary > 7

Level 3: Min support = 7 for level

3

The mining process stops at the lowest level of the hierarchies or when an empty large 1 predicate set is

derived.

A rule example:is_a(X, large town) adjacent(X, sea) close_to (X, us_boundary) : (100%)


Classification and Regression

• Classification:– constructs a model (classifier) based on

the training set and uses it in classifying new data

– Example: Climate Classification,…

• Regression:– models continuous-valued functions, i.e.,

predicts unknown or missing values– Example: stock trends prediction,…

Classification

• Definition : D L

where D is the domain of , i.e., the domain of attribute values and L is the set of levels or classes. For example, in a problem of habitat of birds, D is a space of three dimensions: longevity of the vegetation, depth of water, and distance to water. L has two possible values: nest and no nest.

• The goal is to find a good .

Classification (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6

THEN tenured = ‘yes’

Classifier(Model)

Classification (2): Prediction Using the Model

Classifier

TestingData Unseen Data

(Jeff, Professor, 4)

Tenured?

Classification Techniques

• Decision Tree Induction• Bayesian Classification• Neural Networks• Genetic Algorithms• Fuzzy Set and Logic

Regression

• Regression is similar to classification– First, construct a model– Second, use model to predict unknown value

• Methods– Linear and multiple regression– Non-linear regression

• Regression is different from classification– Classification refers to predict categorical

class label– Regression models continuous-valued

functions

Predicting Location Using Similarity Map

• Given:– S is a set of locations { s1,…sn} in a geographic space G.

– A collection of exploratory functions xk : S Rk, where Rk , k = 1 .. K, is the range of possible values for the exploratory functions.

– A dependent class variable c : S C= c1,..cM

– A value for parameter , the relative importance of spatial accuracy.

• Find: classification model: c : R1 x …Rk C

• Objective: Maximize similarity (map si S (c (x1,… xk)),map(c )) = (1- ) classification accuracy(c , c ) + spatial accuracy (c , c )

Predicting Locations Using Similarity Map

• Constraints:– the geographic space S is the Euclidean space

– The values of exploratory functions, x1.. xk , dependent classvariable, c , can depend on the neighbors’ values (spatial auto-correlation)

– The domain Rk of the exploratory ina domain of real numbers– The domain of dependent variable C = {0,1}.

• Two characteristics:– Spatial autocorrelation– The objective function combines spatial and classification accuracy.

Clustering• It is the process of finding groups, without knowing in advance

the number and the labels of the groups.• Examples: the counties in Chile can be clustered based on 4

attributes: – Porcentaje de desempleo– Población– Ingreso por cápita– Expectativa de vida

• Two types of clustering with different objectives are: – Identify the central cities and their influence region by means of the

variance of the attribute values within the space. – Identify areas in the space where an attribute is homogeneous,

Clustering• Definition 1:

– Given a set of S = {s1,..sn} spatial objects (ex., points) and a real valued no spatial attribute evaluated over S (: S R).

– Find two disjoint subsets S, C and NC = S - C, where C = {s1,…,sk}, NC = {nc1,…,ncl} y k < n

– Goal min C S ∑l j=1 | (ncj) - ∑k

i=1 ((ci )/ dist(ncj,ci))|2

– where dist(a,b) is the Euclidian distance or some distance measure Constraints:

• It satisfies that the influence of the center decreases with the square of the distance

• There is at most one non spatial attribute

Clustering• Definition 2:

– Given a set of S = {s1,..sn} spatial objects objests, a set of real valued no spatial attributes i,con i = 1,… I defined over S (k: S R) and a structure of neighborhood E in S.

– Find K subsets Ck S, with k = 1 .. K, such that

– Goal min Ck S ∑ ck,si Ck,sj

Ck dist(F(si),F(sj))+ ∑ I,j nbddist(Ci,Cj)

– where F is the cross product of ’is, I = 1..n; dist(a,b) is the distance measure and nbddist(C,D) is the number of points in C and D that belong to E, I.e., pair of neighbors mapped to different clusters.

– Constraints: |Ck| > 1 for all k = 1 .. K

Clustering: categories• Hierarchical methods: Starting with a cluster, successive

partitions are made until a criterion is satisfied. These algorithms result in a tree of clusters called dendograms.

• Partitional: It considers each pattern as a cluster and then, reallocate data in each cluster until a criterion is satisfied. This methods tend to find clusters of spherical shape.

• Density-based: It finds clusters based on the density of points in a regions.

• Grid-based: It partitions the space in cells and then, it performs the required operations on the quantized space. Cells that contain many points are considered dense and connected to create clusters.

Clustering in SDB• The idea is to make use of the indexing. If the SDB is large, not

all the points will fit in main memory. • For example, for an algorithm that requires n initial points to

represent n clusters, a natural idea is to incorporate the notion of containment in the indexing definition to find the closest objects.

• A method that finds a centroid of subdivisions of the space is the Voronoi triangulation.

Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of...

Documents

Transcript of Spatial Data Mining Data mining is a process of knowledge discovery related to finding patterns of...