Lecture 9 Spatial Data Mining - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... ·...
Transcript of Lecture 9 Spatial Data Mining - Fudan Universityadmis.fudan.edu.cn/member/sgzhou/courses/data... ·...
Data Mining: Tech. & Appl.
Lecture 9Spatial Data Mining
Zhou Shuigeng
May 27, 2007
Data Mining: Tech. & Appl.
OutlineSpatial DatabasesSpatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
OutlineSpatial DatabasesSpatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
Spatial DataSpatial data has location or geo-referenced featuresSome of these features are:
Address, latitude/longitude (explicit)Location-based partitions in databases (implicit)
Data Mining: Tech. & Appl.
Spatial DatabasesSpatial Database Systems (SDBS)
database systems supporting spatial datatypes in data model and implementationobjects with location and extension in a multi-dimensional space
Data Mining: Tech. & Appl.
Spatial Data FormatRaster Data
represents spatial data as rows / columns of pixels (volume representation)obtained from equipment such as earth observation satellites which measure the emitted / reflected amplitude in some frequency band
Vector Datarepresent spatial data by their boundary (boundary representation)points, lines, polygons, polyhedrons, etc.often obtained from raster data using image processing methods
Data Mining: Tech. & Appl.
Spatial Queries (1)Spatial selection may involve specialized selection comparison operations:
NearNorth, South, East, WestContained inOverlap/intersect
Region (Range) query find objects that intersect a given regionNearest neighbor query find object close to identified objectDistance scan find object within a certain distance of an identified object where distance is made increasingly larger
Data Mining: Tech. & Appl.
Spatial Queries (2)
Data Mining: Tech. & Appl.
Spatial Queries (3)
Data Mining: Tech. & Appl.
Spatial Data StructuresData structures designed specifically to store or index spatial dataOften based on B-tree or Binary Search TreeCluster data on disk based on geographic locationMay represent complex spatial structure by placing the spatial object in a containing structure of a specific geographic shapeTechniques:
Quad TreeR-Treek-D Tree
Data Mining: Tech. & Appl.
MBRMinimum Bounding RectangleSmallest rectangle that completely contains the object
Data Mining: Tech. & Appl.
MBR Examples
Data Mining: Tech. & Appl.
Quad TreeHierarchical decomposition of the space into quadrants (MBRs)Each level in the tree represents the object as the set of quadrants which contain any portion of the objectEach lower level is a more exact representation of the objectThe number of levels is determined by the degree of accuracy desired
Data Mining: Tech. & Appl.
Quad Tree Example
Data Mining: Tech. & Appl.
R-TreeAs with Quad Tree the region is divided into successively smaller rectangles (MBRs).Rectangles need not be of the same size or number at each levelRectangles may actually overlapLowest level cell has only one objectTree maintenance algorithms similar to those for B-trees
Data Mining: Tech. & Appl.
R-Tree Example
Data Mining: Tech. & Appl.
K-D TreeDesigned for multi-attribute data, not necessarily spatialVariation of binary search treeEach level is used to index one of the dimensions of the spatial objectLowest level cell has only one objectDivisions not based on MBRs but successive divisions of the dimension range
Data Mining: Tech. & Appl.
k-D Tree Example
Data Mining: Tech. & Appl.
Topological RelationshipsDisjointOverlaps or IntersectsEqualsCovered by or inside or contained inCovers or contains
Data Mining: Tech. & Appl.
Distance Between Objects
EuclideanManhattanExtensions:
Data Mining: Tech. & Appl.
OutlineSpatial DatabasesWhat’s Spatial Data Mining?Spatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
Spatial Data Mining (SDM)The process of discovering
interesting,useful, non-trivial patterns from large spatial datasets
Spatial patternsSpatial outlier, discontinuities
bad traffic sensors on highwaysLocation prediction models
model to identify habitat of endangered speciesSpatial clusters
crime hot-spots, cancer clustersCo-location patterns
predator-prey species, symbiosis(共生现象)
Dental health and fluoride(氟化物)
Data Mining: Tech. & Appl.
Spatial Cluster: ExampleThe 1854 Asiatic Cholera(亚细亚霍乱)in London
Data Mining: Tech. & Appl.
Spatial Outliers: ExampleSpatial Outliers
Traffic Data in Twin CitiesAbnormal Sensor DetectionsSpatial and Temporal Outliers
Data Mining: Tech. & Appl.
Predictive Models: ExampleLocation Prediction: Bird Habitat Prediction
Given training dataPredictive model buildingPredict new data
Data Mining: Tech. & Appl.
Co-locations: ExampleGiven: A collection of different types of spatial eventsFind: Co-located subsets of event types
Data Mining: Tech. & Appl.
Data in Spatial Data MiningNon-spatial Information
Same as data in traditional data miningNumerical, categorical, ordinal, boolean, etce.g., city name, city population
Spatial InformationSpatial attribute: geographically referenced
Neighborhood and extentLocation, e.g., longitude, latitude, elevation
Spatial data representationsRaster: gridded spaceVector: point, line, polygonGraph: node, edge, path
Data Mining: Tech. & Appl.
Relationships on Data in Spatial Data Mining (1)
Relationships on non-spatial dataExplicitArithmetic, ranking(ordering), etc.Object is instance of a class, class is a subclass of another class, object is part of another object, object is a membership of a set
Data Mining: Tech. & Appl.
Relationships on Data in Spatial Data Mining (2)
Relationships on Spatial DataMany are implicitRelationship Categories
Set-oriented: union, intersection, and membership, etcTopological: meet, within, overlap, etcDirectional: North, NE, left, above, behind, etcMetric: e.g., Euclidean: distance, area, perimeterDynamic: update, create, destroy, etcShape-based and visibility
Granularity
Data Mining: Tech. & Appl.
Relationships on Data in Spatial Data Mining (3)
Granularity of Spatial DataExamples of granularity
Data Mining: Tech. & Appl.
What’s NOT Spatial Data MiningSimple Querying of Spatial Data
Find neighbors of Canada given names and boundaries of all countries
Testing a hypothesis via a primary data analysisFemale chimpanzee territories are smaller than male territories
Uninteresting or obvious patterns in spatial dataHeavy rainfall in Minneapolis is correlated with heavy rainfall in St. Paul, Given that the two cities are 10 miles apart
Mining of non-spatial dataDiaper sales and beer sales are correlated in evening
Data Mining: Tech. & Appl.
SDM ApplicationsGeology(地质学)
GIS SystemsEnvironmental ScienceAgricultureMedicineRoboticsMay involved both spatial and temporal aspects
Data Mining: Tech. & Appl.
OutlineSpatial DatabasesSpatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
Spatial Data WarehousingSpatial data warehouse: Integrated, subject-oriented, time-variant, and nonvolatile spatial data repositorySpatial data integration: a big issue
Structure-specific formats (raster- vs. vector-based, OO vs. relational models, different storage and indexing, etc.)Vendor-specific formats (ESRI, MapInfo, Integraph, IDRISI, etc.)Geo-specific formats (geographic vs. equal area projection, etc.)
Spatial data cube: multidimensional spatial databaseBoth dimensions and measures may contain spatial components
Data Mining: Tech. & Appl.
Dimensions and Measures in Spatial Data Warehouse
Dimensionsnon-spatial
e.g. “25-30 degrees”generalizes to“hot” (both are strings)
spatial-to-nonspatiale.g. Seattle generalizes to description “Pacific Northwest” (as a string)
spatial-to-spatiale.g. Seattle generalizes to Pacific Northwest (as a spatial region)
Measuresnumerical (e.g. monthly revenue of a region)
distributive (e.g. count, sum)algebraic (e.g. average)holistic (e.g. median, rank)
spatialcollection of spatial pointers (e.g. pointers to all regions with temperature of 25-30 degrees in July)
Data Mining: Tech. & Appl.
Spatial-to-Spatial Generalization
Generalize detailed geographic points into clustered regions, such as businesses, residential, industrial, or agricultural areas, according to land usageRequires the merging of a set of geographic areas by spatial operations
Dissolve
Merge
Clip
Intersect
Union
Data Mining: Tech. & Appl.
Example: British Columbia Weather Pattern Analysis
InputA map with about 3,000 weather probes scattered in B.C.Daily data for temperature, precipitation, wind velocity, etc.Data warehouse using star schema
OutputA map that reveals patterns: merged (similar) regions
GoalsInteractive analysis (drill-down, slice, dice, pivot, roll-up)Fast response timeMinimizing storage space used
ChallengeA merged region may contain hundreds of “primitive” regions (polygons)
Data Mining: Tech. & Appl.
Star Schema of the BC Weather WarehouseSpatial data warehouse
Dimensionsregion_nametimetemperatureprecipitation
Measurementsregion_mapareacount
Fact tableDimension table
Data Mining: Tech. & Appl.
Dynamic Merging of Spatial Objects
Materializing (precomputing) all?—too much storage spaceOn-line merge?—slow, expensivePrecompute rough approximations?—accuracy trade offA better way: object-based, selective (partial) materialization
Data Mining: Tech. & Appl.
Methods for Computing Spatial Data Cubes
On-line aggregation: collect and store pointers to spatial objects in a spatial data cube
expensive and slow, need efficient aggregation techniquesPrecompute and store all the possible combinations
huge space overheadPrecompute and store rough approximations in a spatial data cube
accuracy trade-offSelective computation: only materialize those which will be accessed frequently
a reasonable choice
Data Mining: Tech. & Appl.
OutlineSpatial DatabasesSpatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
Spatial Mining TasksSpatial correlationSpatial regressionSpatial associationSpatial co-locationSpatial classificationSpatial clusteringSpatial outlier detection
Data Mining: Tech. & Appl.
Spatial Auto-correlation (SA)First Law of Geography
All things are related, but nearby things are more related than distant things
Tobler [1970]Examples
People with similar backgrounds tend to live in the same areaEconomies of nearby regions tend to be similarChanges in temperature occur gradually over space (and time)
Data Mining: Tech. & Appl.
Spatial Correlation MeasuresSpatial Autocorrelation
Measuresdistance-based(e.g., K-function)neighbor-based(e.g., Moran’s I)
Spatial Cross-CorrelationMeasures
distance-based, e.g., cross K-function
Data Mining: Tech. & Appl.
Moran’s I MeasureDefinition
z= {x1 −x^-, . . . , xn − x^-}xi : data values; x^-: mean of x; n: number of dataW is the row-normalized contiguity matrix
Data Mining: Tech. & Appl.
Moran’s I MeasureRanges between -1 and +1
higher positive value ⇒ high SA, Cluster, Attractlower negative value ⇒ interspersed, de-clustered, repel
e.g., spatial randomness ⇒ MI = 0e.g., distribution of vegetation durability ⇒MI = 0.7e.g., checker board ⇒ MI = -1
Data Mining: Tech. & Appl.
K-FunctionK-function Definition:
Test against randomness for point patternK(h) = λ−1E[number of events within distance h of an arbitrary event]
λ is intensity of eventModel departure from randomness in a wide range of scales
Data Mining: Tech. & Appl.
K-Function: ExampleFor Poisson complete spatial randomness(csr): K(h) = πh2
Data Mining: Tech. & Appl.
Cross-CorrelationCross K-Function Definition
Kij(h) = λ−1 E [number of type j event within distance h of a randomly chosen type i event]Cross K-function of some pair of spatial feature typesExample
Which pairs are frequently co-located?
Data Mining: Tech. & Appl.
Cross-Correlation: Example
Data Mining: Tech. & Appl.
Cross-Correlation: Example
Data Mining: Tech. & Appl.
Location PredictionGiven
n spatial objects:d different features / maps:a dependent (target) class:a family of function mappings:
Finda classifier predicting the location of objects of the given classes which maximizes classification accuracy
Data Mining: Tech. & Appl.
Location Prediction: Exampleknown nest locationsTask: predict other nest locations using the maps below
Data Mining: Tech. & Appl.
Location Prediction: MethodsPrediction
• Continuous: trend, e.g., regressionLocation aware: spatial autoregressive model(SAR)
Discrete: classification, e.g., Bayesian classifier
Location aware: Markov random fields(MRF)
Data Mining: Tech. & Appl.
Spatial Contextual Model: SARSpatial Autoregressive Model (SAR)
y = ρWy + X β + εAssume that dependent values y are related to each other yi = f(yj) for i ≠ jDirectly model spatial autocorrelation using W
Geographically Weighted Regression (GWR)A method of analyzing spatially varying relationships
parameter estimates vary locallyModels with Gaussian, logistic or Poisson forms can be fittedExample: y = X β′ + ε′.
where β′ and ε′ are location dependent
Data Mining: Tech. & Appl.
Spatial Contextual Model: MRFMarkov Random Fields Gaussian Mixture Model (MRF-GMM)
Undirected graph to represent the interdependency relationship of random variablesA variable depends only on neighborsIndependent of all other variablesfC(Si) independent of fC(Sj) if W(si, sj) = 0Predict fC(Si) , given feature value X and neighborhood class label CN
Assume Pr(ci), Pr(X,CN|ci) and Pr(X,CN) are mixture of Gaussian distributions
Data Mining: Tech. & Appl.
Spatial Association RulesA spatial association rule is an association rule containing at least one spatial neighborhood relationSpatial association rule: A ⇒ B [s%, c%]
A and B are sets of spatial or non-spatial predicatesTopological relations: intersects, overlaps, disjoint, etc.Spatial orientations: left_of, west_of, under, etc.Distance information: close_to, within_distance, etc.
s% is the support and c% is the confidence of the rule
Data Mining: Tech. & Appl.
Spatial Association Rules Mining Methods
Examplesis_a(x, large_town) ^ intersect(x, highway) => adjacent_to(x, water)
[7%, 85%]Two approaches
Transaction based approachTransaction free approach
Data Mining: Tech. & Appl.
Transaction-Based ApproachDetermine object type of interest (target object type)Transform spatial database into set of transactions
Transaction = one target object plus set of neighboring objects
neighborhood definition is crucialApply (modified) algorithm for mining frequent itemsets
e.g., Apriori algorithm
Data Mining: Tech. & Appl.
Progressive Refinement Mining of Spatial Association Rules
Hierarchy of spatial neighborhood relations “g_close_to” may be specialized to near_by, touch, intersect, contain, etc.Basic Idea: if two objects do not fulfill a rough relationship (such as intersect) they cannot fulfill a refined relationship (such as meet)Two-step procedure for spatial neighborhood relations
Step 1: rough spatial computation (as a filter)Using MBR or R-tree for rough estimation
Step2: Detailed spatial algorithm (as refinement)Is very expensive (e.g. intersect test)Apply only to those objects which have passed the rough spatial association test (no less than min_support)
Data Mining: Tech. & Appl.
Example
Data Mining: Tech. & Appl.
Example
Data Mining: Tech. & Appl.
Transaction-Free ApproachTransaction-based approach requires target object type, which restricts set of rules discoveredAlternative approach: based on cliques of neighboring objects
R-proximity neighborhoodsDatabase: set of spatial features of different types (e.g., A, B, C):
Example of R-proximity neighborhoods
Data Mining: Tech. & Appl.
Transaction-Free ApproachCo-location: set of feature types, e.g., {A,C} or {A,B,C}Participation ratio of fi in c: proportion of instances of feature (type) fi participating in co-location c
participation ratio of A in {A,B} = 2/3 = 0.67participation ratio of B in {A,B} = 2/2 = 1.0
Participation index: minimum participation ratio over all features fi in a co-location c
participation index of {A,B} = min{0.67, 1.0} = 0.67Participation index is an upper bound of the cross-K function (Spatial Statistics)Participation index is monotonically decreasing with increasing co-location size
Goal: find all co-locations with minimum participation index
Data Mining: Tech. & Appl.
The MethodAlternatives for generation co-location candidates
combinatorial join, geometric join, hybrid approachPruning of candidates using the participation indexMulti-resolution pruning
Start with coarse resolution neighborhood definitionPrune if coarse resolution participation falls below threshold
anti-monotone because of spatial auto-correlationDecrease resolution of neighborhood definition
Data Mining: Tech. & Appl.
Spatial Cluster Analysis
Mining clusters: k-means, k-medoids, hierarchical, density-based, etc.Analysis of distinct features of the clusters
Data Mining: Tech. & Appl.
Constraints-Based ClusteringConstraints on individual objects
Simple selection of relevant objects before clustering
Clustering parameters as constraintsK-means, density-based: radius, min-# of points
Constraints specified on clusters using SQL aggregates
Sum of the profits in each cluster > $1 millionConstraints imposed by physical obstacles
Clustering with obstructed distance
Data Mining: Tech. & Appl.
Constraint-Based Clustering: Planning ATM Locations
Mountain
RiverBridg
e
Spatial data with obstacles
C1
C2C3
C4
Clustering without takingobstacles into consideration
Data Mining: Tech. & Appl.
Mining Spatiotemporal Data
Spatiotemporal dataData has spatial extensions and changes with time Ex: Forest fire, moving objects, hurricane & earthquakes
Automatic anomaly detection in massive moving objects
Moving objects are ubiquitous: GPS, radar, etc.Ex: Maritime vessel surveillance
Problem: Automatic anomaly detection
Data Mining: Tech. & Appl.
Analysis: Mining Anomaly in Moving Objects
Raw analysis of collected data does not fully convey “anomaly” informationMore effective analysis relies on higher semantic featuresExamples:
A speed boat moving quickly in open waterA fishing boat moving slowly into the docksA yacht circling slowly around landmark during night hours
Data Mining: Tech. & Appl.
Framework: Motif-Based Feature Analysis
Motif-based representationA motif is a prototypical movement patternView a movement path as a sequence of motif expressions
Motif-oriented feature spaceAutomated motif feature extractionSemantic-level features
ClassificationAnomaly detection via classificationHigh dimensional classifier
Data Mining: Tech. & Appl.
Movement MotifsPrototypical movement of object
Right-turn, U-turnCan be either defined by an expert or discovered automatically from data
Defined in our frameworkExtracted in movement pathsPath becomes a set ofmotif expressions
Data Mining: Tech. & Appl.
Motif Expression AttributesEach motif expression has attributes (e.g., speed, location, size)Attributes express how a motif was expressedConveys semantic information useful for classification
a tight circle at 30mph near landmark Y.A tight circle at 10mph in location X
Data Mining: Tech. & Appl.
Motif-Oriented Feature SpaceAttributes describe how motifs are expressedLet there be A attributes, each path is a set of (A+1)-tuples
{(mi, v1, v2, …, vA), (mj, v1, v2, …, vA)}Naïve Feature space construction
1. Let each distinct (mj, v1, v2, …, vA) be a feature2. If path exhibits a particular motif-expression, its
value is 1. Otherwise, its value is 0.
Data Mining: Tech. & Appl.
Analyzing Naïve Feature SpaceLet there be M distinct motifs and V different possible values for each of the A attributesSize of feature space is
M * VA
V is usually very large due to high granularity of measurements
E.g., seconds for time or meters for locationModest values for A and M could lead to extremely high dimensional feature space
Data Mining: Tech. & Appl.
More on Naïve Feature SpaceHigh dimensional feature space could make effective learning hardMore importantly, high granular features make generalization impossible!
(mj, v1, 10:01am, …, vA) vs (mj, v1, 10:02am, …, vA)Learning on one feature has no effect on another feature
Intuition: should have features that describe general high-level concepts
“Early Morning” instead of 2:03am, 2:04am, …“Near Location X” instead of “50m west of Location X”
Solution: Clustering on naïve feature space
Data Mining: Tech. & Appl.
Motif Feature ExtractionFor each motif attribute, cluster values to form higher level conceptsFrequency and distribution in learning data dictates the final clustersHierarchical micro-clustering
Small clusters so concepts are not merged unnecessarilyHierarchy allows flexibility in describing objects
For example: “afternoon” vs. “early afternoon” and “late afternoon”
Data Mining: Tech. & Appl.
Feature ClusteringRough, fast micro-clustering method based on BIRCH (SIGMOD’96)A micro-cluster is represented by a CF Vector: CF = (n, LS, SS)Centroid and radius can be calculated from CF vectorCF Additive Theorem allows two CF Vectors to be combined quickly and losslesslyCF Tree is a hierarchy of CF Vectors
A parent CF Vector holds information for all descendent CF VectorsLeaf CF Vector corresponds to a set of actual points
Data Mining: Tech. & Appl.
More on Feature ClusteringBuild CF Tree from raw data, much like B-treeTwo parameters in clustering
B: branching factor of CF TreeT: radius threshold of CF Vector
Parameters control how fine micro-clusters are constructedHierarchical agglomerative clustering on leaves of CF TreeEntire process is efficient: O(N)
Data Mining: Tech. & Appl.
Extracted Feature SpaceLeaf nodes in final clustering become the new featuresMore general than the original naïve feature spaceDimensionality could still be moderately highUse Support Vector Machine for classification
Data Mining: Tech. & Appl.
ExperimentsSynthetic Data
Generated at motif-expression levelAbnormal paths are injected with abnormal motif-expressions
ClassifiersSVM using naïve feature spaceSVM using extracted feature spaces of varying refinement levels
Data Mining: Tech. & Appl.
Experiment
Data Mining: Tech. & Appl.
Experiment (2)
Data Mining: Tech. & Appl.
Summary: Moving Object Anomaly Detection
Higher level semantic analysis of moving objects yields better resultsAutomated feature extractionFuture work
Automatic determination of t parameterBetter use of feature space hierarchyOther analysis, such as clustering and local outlier detection for anomaly detectionMining other knowledge for moving objects
Data Mining: Tech. & Appl.
OutlineSpatial Databases Spatial Data MiningSpatial Data WarehousingSpatial Data Mining MethodsSummaryReferences
Data Mining: Tech. & Appl.
Summary (1)What’s Special About Spatial Data Mining?
Input DataStatistical FoundationOutput PatternsComputational Process
Data Mining: Tech. & Appl.
Summary (2)
Data Mining: Tech. & Appl.
References (1)J. Roddick, K. Hornsby and M. Spiliopoulou, Yet AnotherBibliography of Temporal, Spatial Spatio-temporal Data Mining Research, KDD Workshop, 2001S. Shekhar, C. T. Lu, and P. Zhang, A Unified Approach to Detecting Spatial Outliers, GeoInformatica, 7(2), KluwerAcademic Publishers, 2003S. Shekhar and S. Chawla, Spatial Databases: A Tour, Prentice Hall, 2003S. Shekhar, P. Schrater, R. Vatsavai, W. Wu, and S. Chawla, Spatial Contextual Classification and Prediction Models for Mining Geospatial Data, IEEE Transactions on Multimedia (special issue on Multimedia Databases), 2002
Data Mining: Tech. & Appl.
References (2)S. Shekhar and Y. Huang, Discovering Spatial Co-location Patterns: A Summary of Results ,SSTD, 2001A. Fotheringham, C. Brunsdon, and M. Charlton, Geographically Weighted Regression : The Analysis of Spatially Varying Relationships, John Wiley & Sons, 2002.P. Tan and M. Steinbach and V. Kumar and C. Potter and S. Klooster and A. Torregrosa, Finding Spatio-Temporal Patterns in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001P. Zhang, Y. Huang, S. Shekhar, and V. Kumar, Exploiting Spatial Autocorrelation to Efficiently Process Correlation-Based Similarity Queries, SSTD, 2003