Lecture 9 Spatial Data Mining - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... ·...

Data Mining: Tech. & Appl.

Lecture 9Spatial Data Mining

Zhou Shuigeng

May 27, 2007


OutlineSpatial DatabasesSpatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences


Spatial DataSpatial data has location or geo-referenced featuresSome of these features are:

Address, latitude/longitude (explicit)Location-based partitions in databases (implicit)


Spatial DatabasesSpatial Database Systems (SDBS)

database systems supporting spatial datatypes in data model and implementationobjects with location and extension in a multi-dimensional space


Spatial Data FormatRaster Data

represents spatial data as rows / columns of pixels (volume representation)obtained from equipment such as earth observation satellites which measure the emitted / reflected amplitude in some frequency band

Vector Datarepresent spatial data by their boundary (boundary representation)points, lines, polygons, polyhedrons, etc.often obtained from raster data using image processing methods


Spatial Queries (1)Spatial selection may involve specialized selection comparison operations:

NearNorth, South, East, WestContained inOverlap/intersect

Region (Range) query find objects that intersect a given regionNearest neighbor query find object close to identified objectDistance scan find object within a certain distance of an identified object where distance is made increasingly larger


Spatial Queries (2)


Spatial Queries (3)


Spatial Data StructuresData structures designed specifically to store or index spatial dataOften based on B-tree or Binary Search TreeCluster data on disk based on geographic locationMay represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shapeTechniques:

Quad TreeR-Treek-D Tree


MBRMinimum Bounding RectangleSmallest rectangle that completely contains the object


MBR Examples


Quad TreeHierarchical decomposition of the space into quadrants (MBRs)Each level in the tree represents the object as the set of quadrants which contain any portion of the objectEach lower level is a more exact representation of the objectThe number of levels is determined by the degree of accuracy desired


Quad Tree Example


R-TreeAs with Quad Tree the region is divided into successively smaller rectangles (MBRs).Rectangles need not be of the same size or number at each levelRectangles may actually overlapLowest level cell has only one objectTree maintenance algorithms similar to those for B-trees


R-Tree Example


K-D TreeDesigned for multi-attribute data, not necessarily spatialVariation of binary search treeEach level is used to index one of the dimensions of the spatial objectLowest level cell has only one objectDivisions not based on MBRs but successive divisions of the dimension range


k-D Tree Example


Topological RelationshipsDisjointOverlaps or IntersectsEqualsCovered by or inside or contained inCovers or contains


Distance Between Objects

EuclideanManhattanExtensions:


OutlineSpatial DatabasesWhat’s Spatial Data Mining?Spatial Data WarehousingSpatial Data Mining MethodsSummaryReferences


Spatial Data Mining (SDM)The process of discovering

interesting,useful, non-trivial patterns from large spatial datasets

Spatial patternsSpatial outlier, discontinuities

bad traffic sensors on highwaysLocation prediction models

model to identify habitat of endangered speciesSpatial clusters

crime hot-spots, cancer clustersCo-location patterns

predator-prey species, symbiosis（共生现象）

Dental health and fluoride（氟化物）


Spatial Cluster: ExampleThe 1854 Asiatic Cholera（亚细亚霍乱）in London


Spatial Outliers: ExampleSpatial Outliers

Traffic Data in Twin CitiesAbnormal Sensor DetectionsSpatial and Temporal Outliers


Predictive Models: ExampleLocation Prediction: Bird Habitat Prediction

Given training dataPredictive model buildingPredict new data


Co-locations: ExampleGiven: A collection of different types of spatial eventsFind: Co-located subsets of event types


Data in Spatial Data MiningNon-spatial Information

Same as data in traditional data miningNumerical, categorical, ordinal, boolean, etce.g., city name, city population

Spatial InformationSpatial attribute: geographically referenced

Neighborhood and extentLocation, e.g., longitude, latitude, elevation

Spatial data representationsRaster: gridded spaceVector: point, line, polygonGraph: node, edge, path


Relationships on Data in Spatial Data Mining (1)

Relationships on non-spatial dataExplicitArithmetic, ranking(ordering), etc.Object is instance of a class, class is a subclass of another class, object is part of another object, object is a membership of a set



Relationships on Spatial DataMany are implicitRelationship Categories

Set-oriented: union, intersection, and membership, etcTopological: meet, within, overlap, etcDirectional: North, NE, left, above, behind, etcMetric: e.g., Euclidean: distance, area, perimeterDynamic: update, create, destroy, etcShape-based and visibility

Granularity



Granularity of Spatial DataExamples of granularity


What’s NOT Spatial Data MiningSimple Querying of Spatial Data

Find neighbors of Canada given names and boundaries of all countries

Testing a hypothesis via a primary data analysisFemale chimpanzee territories are smaller than male territories

Uninteresting or obvious patterns in spatial dataHeavy rainfall in Minneapolis is correlated with heavy rainfall in St. Paul, Given that the two cities are 10 miles apart

Mining of non-spatial dataDiaper sales and beer sales are correlated in evening


SDM ApplicationsGeology（地质学）

GIS SystemsEnvironmental ScienceAgricultureMedicineRoboticsMay involved both spatial and temporal aspects


Spatial Data WarehousingSpatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repositorySpatial data integration: a big issue

Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.)Vendor-specific formats (ESRI, MapInfo, Integraph, IDRISI, etc.)Geo-specific formats (geographic vs. equal area projection, etc.)

Spatial data cube: multidimensional spatial databaseBoth dimensions and measures may contain spatial components


Dimensions and Measures in Spatial Data Warehouse

Dimensionsnon-spatial

e.g. “25-30 degrees”generalizes to“hot” (both are strings)

spatial-to-nonspatiale.g. Seattle generalizes to description “Pacific Northwest” (as a string)

spatial-to-spatiale.g. Seattle generalizes to Pacific Northwest (as a spatial region)

Measuresnumerical (e.g. monthly revenue of a region)

distributive (e.g. count, sum)algebraic (e.g. average)holistic (e.g. median, rank)

spatialcollection of spatial pointers (e.g. pointers to all regions with temperature of 25-30 degrees in July)


Spatial-to-Spatial Generalization

Generalize detailed geographic points into clustered regions, such as businesses, residential, industrial, or agricultural areas, according to land usageRequires the merging of a set of geographic areas by spatial operations

Dissolve

Merge

Clip

Intersect

Union


Example: British Columbia Weather Pattern Analysis

InputA map with about 3,000 weather probes scattered in B.C.Daily data for temperature, precipitation, wind velocity, etc.Data warehouse using star schema

OutputA map that reveals patterns: merged (similar) regions

GoalsInteractive analysis (drill-down, slice, dice, pivot, roll-up)Fast response timeMinimizing storage space used

ChallengeA merged region may contain hundreds of “primitive” regions (polygons)


Star Schema of the BC Weather WarehouseSpatial data warehouse

Dimensionsregion_nametimetemperatureprecipitation

Measurementsregion_mapareacount

Fact tableDimension table


Dynamic Merging of Spatial Objects

Materializing (precomputing) all?—too much storage spaceOn-line merge?—slow, expensivePrecompute rough approximations?—accuracy trade offA better way: object-based, selective (partial) materialization


Methods for Computing Spatial Data Cubes

On-line aggregation: collect and store pointers to spatial objects in a spatial data cube

expensive and slow, need efficient aggregation techniquesPrecompute and store all the possible combinations

huge space overheadPrecompute and store rough approximations in a spatial data cube

accuracy trade-offSelective computation: only materialize those which will be accessed frequently

a reasonable choice


Spatial Mining TasksSpatial correlationSpatial regressionSpatial associationSpatial co-locationSpatial classificationSpatial clusteringSpatial outlier detection


Spatial Auto-correlation (SA)First Law of Geography

All things are related, but nearby things are more related than distant things

Tobler [1970]Examples

People with similar backgrounds tend to live in the same areaEconomies of nearby regions tend to be similarChanges in temperature occur gradually over space (and time)


Spatial Correlation MeasuresSpatial Autocorrelation

Measuresdistance-based(e.g., K-function)neighbor-based(e.g., Moran’s I)

Spatial Cross-CorrelationMeasures

distance-based, e.g., cross K-function


Moran’s I MeasureDefinition

z= {x1 −x^-, . . . , xn − x^-}xi : data values; x^-: mean of x; n: number of dataW is the row-normalized contiguity matrix


Moran’s I MeasureRanges between -1 and +1

higher positive value ⇒ high SA, Cluster, Attractlower negative value ⇒ interspersed, de-clustered, repel

e.g., spatial randomness ⇒ MI = 0e.g., distribution of vegetation durability ⇒MI = 0.7e.g., checker board ⇒ MI = -1


K-FunctionK-function Definition:

Test against randomness for point patternK(h) = λ−1E[number of events within distance h of an arbitrary event]

λ is intensity of eventModel departure from randomness in a wide range of scales


K-Function: ExampleFor Poisson complete spatial randomness(csr): K(h) = πh2


Cross-CorrelationCross K-Function Definition

Kij(h) = λ−1 E [number of type j event within distance h of a randomly chosen type i event]Cross K-function of some pair of spatial feature typesExample

Which pairs are frequently co-located?


Cross-Correlation: Example


Location PredictionGiven

n spatial objects:d different features / maps:a dependent (target) class:a family of function mappings:

Finda classifier predicting the location of objects of the given classes which maximizes classification accuracy


Location Prediction: Exampleknown nest locationsTask: predict other nest locations using the maps below


Location Prediction: MethodsPrediction

• Continuous: trend, e.g., regressionLocation aware: spatial autoregressive model(SAR)

Discrete: classification, e.g., Bayesian classifier

Location aware: Markov random fields(MRF)


Spatial Contextual Model: SARSpatial Autoregressive Model (SAR)

y = ρWy + X β + εAssume that dependent values y are related to each other yi = f(yj) for i ≠ jDirectly model spatial autocorrelation using W

Geographically Weighted Regression (GWR)A method of analyzing spatially varying relationships

parameter estimates vary locallyModels with Gaussian, logistic or Poisson forms can be fittedExample: y = X β′ + ε′.

where β′ and ε′ are location dependent


Spatial Contextual Model: MRFMarkov Random Fields Gaussian Mixture Model (MRF-GMM)

Undirected graph to represent the interdependency relationship of random variablesA variable depends only on neighborsIndependent of all other variablesfC(Si) independent of fC(Sj) if W(si, sj) = 0Predict fC(Si) , given feature value X and neighborhood class label CN

Assume Pr(ci), Pr(X,CN|ci) and Pr(X,CN) are mixture of Gaussian distributions


Spatial Association RulesA spatial association rule is an association rule containing at least one spatial neighborhood relationSpatial association rule: A ⇒ B [s%, c%]

A and B are sets of spatial or non-spatial predicatesTopological relations: intersects, overlaps, disjoint, etc.Spatial orientations: left_of, west_of, under, etc.Distance information: close_to, within_distance, etc.

s% is the support and c% is the confidence of the rule


Spatial Association Rules Mining Methods

Examplesis_a(x, large_town) ^ intersect(x, highway) => adjacent_to(x, water)

[7%, 85%]Two approaches

Transaction based approachTransaction free approach


Transaction-Based ApproachDetermine object type of interest (target object type)Transform spatial database into set of transactions

Transaction = one target object plus set of neighboring objects

neighborhood definition is crucialApply (modified) algorithm for mining frequent itemsets

e.g., Apriori algorithm


Progressive Refinement Mining of Spatial Association Rules

Hierarchy of spatial neighborhood relations “g_close_to” may be specialized to near_by, touch, intersect, contain, etc.Basic Idea: if two objects do not fulfill a rough relationship (such as intersect) they cannot fulfill a refined relationship (such as meet)Two-step procedure for spatial neighborhood relations

Step 1: rough spatial computation (as a filter)Using MBR or R-tree for rough estimation

Step2: Detailed spatial algorithm (as refinement)Is very expensive (e.g. intersect test)Apply only to those objects which have passed the rough spatial association test (no less than min_support)


Example


Transaction-Free ApproachTransaction-based approach requires target object type, which restricts set of rules discoveredAlternative approach: based on cliques of neighboring objects

R-proximity neighborhoodsDatabase: set of spatial features of different types (e.g., A, B, C):

Example of R-proximity neighborhoods


Transaction-Free ApproachCo-location: set of feature types, e.g., {A,C} or {A,B,C}Participation ratio of fi in c: proportion of instances of feature (type) fi participating in co-location c

participation ratio of A in {A,B} = 2/3 = 0.67participation ratio of B in {A,B} = 2/2 = 1.0

Participation index: minimum participation ratio over all features fi in a co-location c

participation index of {A,B} = min{0.67, 1.0} = 0.67Participation index is an upper bound of the cross-K function (Spatial Statistics)Participation index is monotonically decreasing with increasing co-location size

Goal: find all co-locations with minimum participation index


The MethodAlternatives for generation co-location candidates

combinatorial join, geometric join, hybrid approachPruning of candidates using the participation indexMulti-resolution pruning

Start with coarse resolution neighborhood definitionPrune if coarse resolution participation falls below threshold

anti-monotone because of spatial auto-correlationDecrease resolution of neighborhood definition


Spatial Cluster Analysis

Mining clusters: k-means, k-medoids, hierarchical, density-based, etc.Analysis of distinct features of the clusters


Constraints-Based ClusteringConstraints on individual objects

Simple selection of relevant objects before clustering

Clustering parameters as constraintsK-means, density-based: radius, min-# of points

Constraints specified on clusters using SQL aggregates

Sum of the profits in each cluster > $1 millionConstraints imposed by physical obstacles

Clustering with obstructed distance


Constraint-Based Clustering: Planning ATM Locations

Mountain

RiverBridg

e

Spatial data with obstacles

C1

C2C3

C4

Clustering without takingobstacles into consideration


Mining Spatiotemporal Data

Spatiotemporal dataData has spatial extensions and changes with time Ex: Forest fire, moving objects, hurricane & earthquakes

Automatic anomaly detection in massive moving objects

Moving objects are ubiquitous: GPS, radar, etc.Ex: Maritime vessel surveillance

Problem: Automatic anomaly detection


Analysis: Mining Anomaly in Moving Objects

Raw analysis of collected data does not fully convey “anomaly” informationMore effective analysis relies on higher semantic featuresExamples:

A speed boat moving quickly in open waterA fishing boat moving slowly into the docksA yacht circling slowly around landmark during night hours


Framework: Motif-Based Feature Analysis

Motif-based representationA motif is a prototypical movement patternView a movement path as a sequence of motif expressions

Motif-oriented feature spaceAutomated motif feature extractionSemantic-level features

ClassificationAnomaly detection via classificationHigh dimensional classifier


Movement MotifsPrototypical movement of object

Right-turn, U-turnCan be either defined by an expert or discovered automatically from data

Defined in our frameworkExtracted in movement pathsPath becomes a set ofmotif expressions


Motif Expression AttributesEach motif expression has attributes (e.g., speed, location, size)Attributes express how a motif was expressedConveys semantic information useful for classification

a tight circle at 30mph near landmark Y.A tight circle at 10mph in location X


Motif-Oriented Feature SpaceAttributes describe how motifs are expressedLet there be A attributes, each path is a set of (A+1)-tuples

{(mi, v1, v2, …, vA), (mj, v1, v2, …, vA)}Naïve Feature space construction

1. Let each distinct (mj, v1, v2, …, vA) be a feature2. If path exhibits a particular motif-expression, its

value is 1. Otherwise, its value is 0.


Analyzing Naïve Feature SpaceLet there be M distinct motifs and V different possible values for each of the A attributesSize of feature space is

M * VA

V is usually very large due to high granularity of measurements

E.g., seconds for time or meters for locationModest values for A and M could lead to extremely high dimensional feature space


More on Naïve Feature SpaceHigh dimensional feature space could make effective learning hardMore importantly, high granular features make generalization impossible!

(mj, v1, 10:01am, …, vA) vs (mj, v1, 10:02am, …, vA)Learning on one feature has no effect on another feature

Intuition: should have features that describe general high-level concepts

“Early Morning” instead of 2:03am, 2:04am, …“Near Location X” instead of “50m west of Location X”

Solution: Clustering on naïve feature space


Motif Feature ExtractionFor each motif attribute, cluster values to form higher level conceptsFrequency and distribution in learning data dictates the final clustersHierarchical micro-clustering

Small clusters so concepts are not merged unnecessarilyHierarchy allows flexibility in describing objects

For example: “afternoon” vs. “early afternoon” and “late afternoon”


Feature ClusteringRough, fast micro-clustering method based on BIRCH (SIGMOD’96)A micro-cluster is represented by a CF Vector: CF = (n, LS, SS)Centroid and radius can be calculated from CF vectorCF Additive Theorem allows two CF Vectors to be combined quickly and losslesslyCF Tree is a hierarchy of CF Vectors

A parent CF Vector holds information for all descendent CF VectorsLeaf CF Vector corresponds to a set of actual points


More on Feature ClusteringBuild CF Tree from raw data, much like B-treeTwo parameters in clustering

B: branching factor of CF TreeT: radius threshold of CF Vector

Parameters control how fine micro-clusters are constructedHierarchical agglomerative clustering on leaves of CF TreeEntire process is efficient: O(N)


Extracted Feature SpaceLeaf nodes in final clustering become the new featuresMore general than the original naïve feature spaceDimensionality could still be moderately highUse Support Vector Machine for classification


ExperimentsSynthetic Data

Generated at motif-expression levelAbnormal paths are injected with abnormal motif-expressions

ClassifiersSVM using naïve feature spaceSVM using extracted feature spaces of varying refinement levels


Experiment


Experiment (2)


Summary: Moving Object Anomaly Detection

Higher level semantic analysis of moving objects yields better resultsAutomated feature extractionFuture work

Automatic determination of t parameterBetter use of feature space hierarchyOther analysis, such as clustering and local outlier detection for anomaly detectionMining other knowledge for moving objects


OutlineSpatial Databases Spatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences


Summary (1)What’s Special About Spatial Data Mining?

Input DataStatistical FoundationOutput PatternsComputational Process


Summary (2)


References (1)J. Roddick, K. Hornsby and M. Spiliopoulou, Yet AnotherBibliography of Temporal, Spatial Spatio-temporal Data Mining Research, KDD Workshop, 2001S. Shekhar, C. T. Lu, and P. Zhang, A Unified Approach to Detecting Spatial Outliers, GeoInformatica, 7(2), KluwerAcademic Publishers, 2003S. Shekhar and S. Chawla, Spatial Databases: A Tour, Prentice Hall, 2003S. Shekhar, P. Schrater, R. Vatsavai, W. Wu, and S. Chawla, Spatial Contextual Classification and Prediction Models for Mining Geospatial Data, IEEE Transactions on Multimedia (special issue on Multimedia Databases), 2002


References (2)S. Shekhar and Y. Huang, Discovering Spatial Co-location Patterns: A Summary of Results ,SSTD, 2001A. Fotheringham, C. Brunsdon, and M. Charlton, Geographically Weighted Regression : The Analysis of Spatially Varying Relationships, John Wiley & Sons, 2002.P. Tan and M. Steinbach and V. Kumar and C. Potter and S. Klooster and A. Torregrosa, Finding Spatio-Temporal Patterns in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001P. Zhang, Y. Huang, S. Shekhar, and V. Kumar, Exploiting Spatial Autocorrelation to Efficiently Process Correlation-Based Similarity Queries, SSTD, 2003

Lecture 9 Spatial Data Mining - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... ·...

Documents

Transcript of Lecture 9 Spatial Data Mining - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... ·...