Mining Regional Knowledge in Spatial Dataset

41
UH Data Mining & Machine Learning Group CS@UH May 1, 2009 Christoph F. Eick Department of Computer Science University of Houston A Domain-Driven Framework for Clustering with Plug-in Fitness Functions and its Application to Spatial Data Mining CACS Lafayette (LA) , May 1, 2009

Transcript of Mining Regional Knowledge in Spatial Dataset

Page 1: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Christoph F. Eick Department of Computer Science

University of Houston

A Domain-Driven Framework for Clustering with Plug-in Fitness Functions and

its Application to Spatial Data Mining

CACS Lafayette (LA) , May 1, 2009

Page 2: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Talk Outline

1.Domain-driven Data Mining (D3M, DDDM)2.A Framework for Clustering with Plug-in Fitness

Functions3.MOSAIC---a Clustering Algorithm that Supports

Plug-in Fitness Functions4.Popular Fitness Functions5.Case Studies: Applications to Spatial Data

Mininga.Co-location Mining b.Multi-objective Clusteringc. Change Analysis in Spatial Data

6.Summary and Conclusion.

Page 3: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Other Contributors to the Work Presented Today

Current PhD Students: Oner-Ulvi Celepcikay Chun-Shen Chen Rachsuda Jiamthapthaksin, Vadeerat RinsurongkawongFormer PhD Student: Wei Ding (Assistant Professor, UMASS, Boston) Former Master Students: Rachana Parmar Dan Jiang Seungchan LeeDomain Experts: Jean-Philippe Nicot (Bureau of Economic Geology, UT Austin) Tomasz F. Stepinski (Lunar and Planetary Institute, Houston) Michael Twa (College of Optometry, University of Houston),

Page 4: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

DDDM—what is it about? Differences concerning the objectives of data mining created a gap

between academia and applications of data mining in business and science.

Traditional data mining targets the production of generic, domain-independent algorithms and tools; as a result, data mining algorithms have little capability to adapt to external, domain-specific constraints and evaluation measures.

To overcome this mismatch, the need to incorporate domain intelligence into data mining algorithms has been recognized by current research. Domain intelligence requires: the involvement of domain knowledge and experts, the consideration of domain constraints and domain-specific evaluation

measures the discovery of in-depth patterns based on a deep domain model

On top of the data-driven framework, DDDM aims to develop novel methodologies and techniques for integrating domain knowledge as well as actionability measures into the KDD process and to actively involves humans.

Page 5: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

The Vision of DDDM

“DDDM…can assist in a paradigm shift from “data-driven hidden pattern mining” to “domain-driven actionable knowledge discovery”, and provides support for KDD to be translated to the real business situations as widely expected.” [CZ07]

Page 6: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

IEEE TKDE Special Issue

Page 7: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

2. Clustering with Plug-in Fitness Functions

Motivation: Finding subgroups in geo-referenced datasets has many

applications. However, in many applications the subgroups to be searched

for do not share the characteristics considered by traditional clustering algorithms, such as cluster compactness and separation.

Domain knowledge frequently imposes additional requirements concerning what constitutes a “good” subgroup.

Consequently, it is desirable to develop clustering algorithms that provide plug-in fitness functions that allow domain experts to express desirable characteristics of subgroups they are looking for.

Only very few clustering algorithms published in the literature provide plug-in fitness functions; consequently existing clustering paradigms have to be modified and extended by our research to provide such capabilities.

Page 8: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Clustering with Plug-In Fitness Functions

Clustering algorithms

No fitness functionProvide plug-infitness function

Fixed Fitness

Function

DBSCANHierarchicalClustering

Implicit Fitness Function

K-MeansCHAMELEON

MOSAIC

PAM

Page 9: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Current Suite of Spatial Clustering Algorithms Representative-based: SCEC[1], SPAM[3], CLEVER[4] Grid-based: SCMRG[1] Agglomerative: MOSAIC[2] Density-based: SCDE [4], DCONTOUR[8] (not really plug-in but some

fitness functions can be simulated)

Clustering Algorithms

Density-based

Agglomerative-basedRepresentative-based

Grid-based

Remark: All algorithms partition a dataset into clusters by maximizing a reward-based, plug-in fitness function.

Page 10: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Spatial Clustering Algorithms Datasets are assumed to have the following structure: (<spatial attributes>;<non-spatial attributes>) e.g. (longitude, latitude; <chemical concentrations>+) Clusters are found in the subspace of the spatial attributes,

called regions in the following. The non-spatial attributes are used by the fitness function but

neither in distance computations nor by the clustering algorithm itself.

Clustering algorithms are assumed to maximize reward-based fitness functions that have the following structure:

where b is a parameter that determines the premium put on cluster size (larger values fewer, larger clusters)

XcXc

ccicrewardXq

*)()()(

Page 11: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

3. MOSAIC—a Clustering Algorithm that Supports Plug-in Fitness Functions

Fig. 6: An illustration of MOSAIC’s approach

(a) input (b) output

MOSAIC[2] supports plug-in fitness functions and provides a generic framework that integrates representative-based clustering, agglomerative clustering, and proximity graphs, and which approximates arbitrary shape clusters using unions of small convex polygons.

Page 12: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

3.1 Representative-based Clustering

Attribute2

Attribute1

1

2

3

4

Objective: Find a set of objects OR such that the clustering X

obtained by using the objects in OR as representatives minimizes q(X).

Properties: • Uses 1NN queries to assign objects to a cluster• Cluster shapes are limited to convex polygonsPopular Algorithms: K-means, K-medoids, CLEVER, SPAM

Page 13: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

3.2 MOSAIC and Agglomerative Clustering

Traditional Agglomerative Clustering Algorithms• Decision which clusters to merge next is made solely based on

distances between clusters.• In particular, two clusters that are closest to each other with

respect to a distance measure (single link, group average,…) are merged.

• Use of some distance measures might lead to non-contiguous clusters.

Example: If group average is used, clusters C3 and C4 would be merged next

Page 14: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

MOSAIC and Agglomerative Clustering

Advantages MOSAIC over traditional agglomerative clustering:

• Plug-in fitness function• Conducts a wider search—considers all

neighboring clusters and merges the pair of clusters that enhances fitness the most

• Clusters are always contiguous • Expensive algorithm is only run for 20-1000 iterations• Highly generic algorithm

Page 15: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

3.3 Proximity Graphs

• How to identify neighbouring clusters for representative-based clustering algorithms?

• Proximity graphs provide various definitions of “neighbour”:

NNG MST RNG GG DT

NNG = Nearest Neighbour GraphMST = Minimum Spanning TreeRNG = Relative Neighbourhood GraphGG = Gabriel GraphDT = Delaunay Triangulation (neighbours of a 1NN-classifier)

Page 16: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Proximity Graphs: Delaunay

• The Delaunay Triangulation is the dual of the Voronoi diagram

• Three points are each others neighbours if their tangent sphere contains no other points

• Complete: captures all neighbouring clusters

• Time-consuming to compute; impossible to compute in high dimensions.

Page 17: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Proximity Graphs: Gabriel

• The Gabriel graph is a subset of the Delaunay Triangulation (some decision boundary might be missed)

• Points are neighbours only if their (diametral) sphere of influence is empty

• Can be computed more efficiently: O(k3)

• Approximate algorithms with faster complexity exist

Page 18: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

MOSAIC’s Input

Fig. 10: Gabriel graph for clusters generated by a representative-based clustering algorithm

Page 19: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

3.4 Pseudo Code MOSAIC

1. Run a representative-based clustering algorithm to create a large number of clusters.2. Read the representatives of the obtained clusters.3. Create a merge candidate relation using proximity graphs.4. WHILE there are merge-candidates (Ci ,Cj) left BEGIN Merge the pair of merge-candidates (Ci,Cj), that enhances fitness function q the most, into a new cluster C’ Update merge-candidates: C Merge-Candidate(C’,C) Merge-Candidate(Ci,C)

Merge-Candidate(Cj,C) END RETURN the best clustering X found.

Page 20: Mining Regional Knowledge in Spatial Dataset

Ch. Eick et al.: MOSAIC…, DaWaK, Regenburg 2007

Complexity MOSAIC

Let

n be the number of objects in the dataset

k be the number of clusters generated by the representative-based algorithm

Complexity MOSAIC: O(k3 + k2*O(q(x)))

Remarks: • The above formula assumes that fitness is computed from

the scratch when a new clustering is obtained• Lower complexities can be obtained with incrementally

reusing results of previous fitness computations• Our current implementation assumes that only additive

fitness functions are used

Page 21: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

4. Interestingness Measure for Spatial Clustering with Plug-in Fitness Functions

Clustering algorithms maximize fitness functions that must have the following structure

Various interestingness functions i have been introduced in our preliminary work: For supervised clustering [1] Maximizing the variance of a continuous variable [5] For regional association rule scoping [9] For co-location patterns involving continuous variables [4] ….

Some examples of fitness functions will be presented in the case studies

XcXc

ccicrewardXq

*)()()(

Page 22: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

5. Case Studies

1. Co-location patterns involving arsenic pollution

2. Multi-objective Clustering

3. Change analysis involving earth quake patterns

Page 23: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

5.1 Co-location Patterns Involving Arsenic Pollution

Page 24: Mining Regional Knowledge in Spatial Dataset

Data Mining & Machine Learning Group CS@UH

Regional Co-location Mining Goal: To discover regional co-location patterns involving

continuous variables in which continuous variables take values from the wings of their statistical distribution

Dataset:(longitude,latitude,<concentrations>+)

RegionalCo-location Mining

Page 25: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Summary Co-location Approach

Pattern Interestingness in a region is evaluated using products of (cut-off) z-scores. In general, products of z-scores measure correlation.

Additionally, purity is considered that is controlled by a parameter .

Finally, the parameter determines how much premium is put on the size of a region when computing region rewards.

Page 26: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

1. Define problem

2. Create/Select a fitness function

4. Select parameters of the clustering algorithm,parameters of the fitness function and constraintswith respect to which patterns are considered

3. Select a clustering algorithm

5. Run the clustering algorithm to discover interesting regions and their associated patterns

6. Analyze the results

Hydrologist

Domain-Driven Clustering for Co-location Mining

Page 27: Mining Regional Knowledge in Spatial Dataset

Table 5. Top 5 regions ranked by reward (as per formula 8).

Exp. No.

Top 5 Regi-ons

Region Size Region RewardMaximum Valued

Pattern in theRegion

PurityAverage Product

for maximum valued pattern

Exp. 2

1 181 61684.5323 AsMoVF- 0.49 52.1019

2 80 24040.6315 AsBCl-TDS 0.48 70.7322

3 467 1884.8856 AsTDS 0.91 0.2047

4 23 701.7072 AsCl-SO42-TDS 0.78 8.1287

5 189 587.9790 AsF- 0.78 0.2909

Exp. 4

1 7 11669.7965 AsBCl-TDS 1.0 630.1097

2 117 10407.3250 AsVF- 0.91 12.8550

3 4 2203.2526 AsV SO42-TDS 1.0 275.4066

4 2 1531.4887 AsMoVB 1.0 541.46305 530 1426.9140 AsTDS 0.90 0.1939

Example: 2 Sets of Results Using Medium/High Rewards for Purity

All: (AsB or AsB) and |B|<5

Experiment 2 = 1.5, θ=1.0

Experiment 4 = 1.5, θ=5.0

Page 28: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Challenges Regional RCLM Kind of “seeking a needle in a haystack” problem, because

we search for both interesting places and interesting patterns.

Our current Interestingness measure is not anti-monotone: a superset of a co-location set might be more interesting.

Observation: different fitness function parameter settings lead to quite different results, many of which are valuable to domain experts; therefore, it is desirable combine results of many runs.

“Clustering of the future”: run clustering algorithms multiple times with multiple fitness functions, and summarize the resultsmulti-run/multi-objective clustering

Page 29: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

5.2 Multi-Run Clustering Find clusters that good with respect to multiple objectives in

automated fashion. Each objective is captured in a reward-based fitness function.

To achieve the goal, we run clustering algorithms multiple times with respect to compound fitness functions that capture multiple objectives and store non-dominated clusters in a cluster repository.

Summarization tools are provided that create final clusterings with respect to a user’s perspective.

Page 30: Mining Regional Knowledge in Spatial Dataset

An Architecture for Multi-objective Clustering

Clustering Algorithm

Storage Unit

Goal-driven Fitness Function Generator

ClusterSummarization

Unit

A SpatialDataset

M

Q’

Q’

X

M’

Steps in multi-run clustering:S1: Generate a compound fitness functions. S2: Run a clustering algorithm. S3: Update the cluster list M. S4: Summarize clusters discovered M’.

S1

S4

S2

S3

Given: set of objectives Q that need to be satisfied; moreover, Q’Q.

Page 31: Mining Regional Knowledge in Spatial Dataset

AsMoVBF- Cl-SO4

2-TDS (Rank 1)

AsMoVF- Cl-SO4

2-TDS (Rank 3)

AsMoVBF- Cl-SO4

2-TDS (Rank 2)

AsMoB Cl-SO4

2-TDS (Rank 5)

AsMo Cl-SO4

2-TDS (Rank 4)

Figure a: the top 5 regions ordered by rewards using user-defined query {As,Mo}

Example: Multi-Objective RCLM

Example: Finding co-location patterns with respect to Arsenic and a single other chemical is a single objective; we are interested in finding co-location regions that satisfy multiple of those objectives; that is, where high arsenicconcentrations are co-located with high concentrations of many otherchemicals.

Page 32: Mining Regional Knowledge in Spatial Dataset

Data Mining & Machine Learning Group CS@UH

Question: How do interesting regions where deep earthquakes are in close proximity to shallow earthquakes change?

Cluster Interestingness Measure: Variance of Earthquake Depth

Red: clusters in Oold; Blue: clusters in Onew

5.3 Change Analysis in Spatial Data

Page 33: Mining Regional Knowledge in Spatial Dataset

Data Mining & Machine Learning Group CS@UH

Novelty Regions in Onew

Novelty Change Predicate:

Novelty(r) |(r—(r’1 r’k))|>0with rXnew; Xold={r’1,...,r’k}

Page 34: Mining Regional Knowledge in Spatial Dataset

Data Mining & Machine Learning Group CS@UH

1. Determine two datasets Oold and Onew forwhich change patterns have to be extracted

3. Determine relevant change predicates and select thresholds of change predicates

2. Cluster both datasets with respect to an interestingness perspective to obtain clusters for each dataset.

4. Instantiate change predicates based onthe results of step 3.

6. Analyze emergent patterns

Geologist

5. Summarize emergent patterns

Domain-Driven Change Analysis in Spatial Data

Page 35: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

6. Conclusion A generic, domain-driven clustering framework has been

introduced It incorporates domain intelligence into domain-specific

plug-in fitness functions that are maximized by clustering algorithms.

Clustering algorithms are independent of the fitness function employed. Several clustering algorithms including prototype-based, agglomerative, and grid-based clustering algorithms have been designed and implemented in our past research.

We conducted several case studies in our past research that illustrate the capability of the proposed domain-driven spatial clustering framework to solve challenging problems in planetary sciences, geology, environmental sciences, and optometry.

Page 36: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

UH-DMML References 1. C. F. Eick, B. Vaezian, D. Jiang, and J. Wang,

Discovery of Interesting Regions in Spatial Datasets Using Supervised Clustering, in Proc. 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), Berlin, Germany, September 2006.

2. C. Choo, R. Jiamthapthaksin, C.-S. Chen, O. Celepcikay, C. Giusti, and C. F. Eick, MOSAIC: A Proximity Graph Approach to Agglomerative Clustering, in Proc. 9th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Regensburg, Germany, September 2007.

3. W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. Stepinski, and C. F. Eick, Towards Region Discovery in Spatial Datasets, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Osaka, Japan, May 2008.

4. C. F. Eick, R. Parmar, W. Ding, T. Stepinki, and J.-P. Nicot, Finding Regional Co-location Patterns for Sets of Continuous Variables in Spatial Datasets, in Proc. 16th ACM SIGSPATIAL International Conference on Advances in GIS (ACM-GIS), Irvine, California, November 2008.

5. C.-S. Chen, V. Rinsurongkawong, C.F. Eick, and M.D. Twa, Change Analysis in Spatial Data by Combining Contouring Algorithms with Supervised Density Functions, in Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Bangkok, Thailand, April 2009.

6. A. Bagherjeiran, O. U. Celepcikay, R. Jiamthapthaksin, C.-S. Chen, V. Rinsurongkawong, S. Lee, J. Thomas, and C. F. Eick, Cougar**2: An Open Source Machine Learning and Data Mining Development Framework, in Proc. Open Source Data Mining Workshop (OSDM), Bangkok, Thailand, April 2009.

7. C. F. Eick, O. U. Celepcikay, and R. Jiamthapthaksin, A Unifying Domain-driven Framework for Clustering with Plug-in Fitness Functions and Region Discovery, submitted to IEEE TKDE.

8. R. Jiamthapthaksin, C. F. Eick, and R. Vilalta, A Framework for Multi-Objective Clustering and its Application to Co-Location Mining, submitted to Fifth International Conference on Advanced Data Mining and Applications (ADMA), Beijing, China, August 2009.

9. W. Ding, C. F. Eick, X. Yuan, J. Wang, and J.-P. Nicot, A Framework for Regional Association Rule Mining and Scoping in Spatial Datasets, under review for publication in Geoinformatica.

Page 37: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Other References1. L. Cao and C. Zhang, “The Evolution of KDD: Towards Domain-Driven Data Mining,”

Journal of Pattern Recognition and Artificial Intelligence, vol.21, no. 4, pp. 677-692, World Scientific Publishing Company, 2007.

2. O. Thonnard and M. Dacier, Actionable Knowledge Discovery for Threats Intelligence Support using a Multi-Dimensional Data Mining Methodology, DDDM08.

Page 38: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Region Discovery Framework

Objective: Develop and implement an integrated framework to automatically discover interesting regional patterns in spatial datasets. Treats region discovery as a clustering problem.

Page 39: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Region Discovery Framework Continued

The clustering algorithms we currently investigate solve the following problem:Given:A dataset O with a schema RA distance function d defined on instances of RA fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:

q(X)= cX reward(c)=cX interestingness(c)*size(c) with b>1

Objective:Find c1,…,ck O such that:1. cicj= if ij2. X={c1,…,ck} maximizes q(X)3. All cluster ciX are contiguous in the spatial subspace4. c1,…,ck O 5. c1,…,ck are usually ranked based on the reward each cluster receives, and

low reward clusters are frequently not reported

Page 40: Mining Regional Knowledge in Spatial Dataset

Data Mining & Machine Learning Group CS@UH

[CZ07]

Page 41: Mining Regional Knowledge in Spatial Dataset

UH Data Mining & Machine Learning Group CS@UHMay 1, 2009

Arsenic Water Pollution Problem Arsenic pollution is a serious problem in the Texas water supply. Hard to explain what causes arsenic pollution to occur. Several Datasets were created using the Ground Water Database

(GWDB) by Texas Water Development Board (TWDB) that tests water wells regularly, one of which was used in the experimental evaluation in the paper: All the wells have a non-null samples for arsenic Multiple sample values are aggregated using avg/max functions Other chemicals may have null values

Format: (Longitude, Latitude, <z-values of chemical concentrations>)