practical of data preprocessing

37
Spatial Temporal Data Mining Spatial Temporal Data Mining Wei Wang Data Mining Lab, UCLA January 21, 1999

description

it is about weka tool

Transcript of practical of data preprocessing

  • Spatial Temporal Data MiningWei WangData Mining Lab, UCLAJanuary 21, 1999

  • OutlineIntroductionStatistical ClusteringUser-defined TriggerSpatial Index Structure for High Dimensional Point DataTemporal Spatial Pattern Detectionongoing research

  • IntroductionSpatial data mining has been an active research area during recent years.For some well know problem, e.g., clustering, many existing algorithms are not efficient enough.There is still room for improvement.There are a lot of interesting problems remaining uninvestigated.We classify a subset of problems and try to solve them efficiently.

  • OutlineSTING: a statistical information grid approach to spatial data miningSTING+: an approach to active spatial data miningPK-tree: a spatial index structure for high dimensional point dataTemporal spatial pattern detection

  • STINGSpatial database is usually huge.Efficiency of the data mining algorithm is crucial.Example: each person is an objectQuery: Find high income area within Californiahigh income: salary > $50,000area > 4 square milesTraditional MethodStep 1: Select out all person whose salary are high.Step 2: Do clustering analysis on those persons selected out.Step 3: Form the region that each cluster occupies.Step 4: Return those regions larger than 4 square miles.If high income is defined as: 80% persons have salary > $50,000then the previous method can not even answer the query.STING was proposed to solve such problem efficiently.

  • STINGRegion Query Example:Select the maximal regions that have at least 100 houses per unit area and at least 70% of the house prices are above $400K and with total area at least 100 units with 90% confidence.SELECT REGIONFROM house-mapWHERE DENSITY IN (100, )AND price RANGE (400000, )WITH PERCENT (0.7, 1)AND AREA (100, )AND WITH CONFIDENCE 0.9

  • STINGObjects are represented by points, each of which has associated spatial attributes, its location, and non-spatial (numerical) attributes.Space is recursively divided into smaller rectangular cells until certain level is reached.A hierarchical structure is employed.The average number of objects in a leaf cell is in the range from several dozens to several thousands.Preprocess datacapture the statistical information

  • STING1st layer(root)(i-1)th layerith layer(i+1)th layer(leaf layer)

  • STINGFor each cell, we haveattribute-independent parametern: number of objectsattribute-dependent parameters (for each numerical attribute)mean: mean value of the attributestd: standard deviation of the attribute valuemin: the minimum value of the attributemax: the maximum value of the attributedistribution: the type of distribution that best fits the attribute value (can be NONE)Bottom-up generation when the data is loaded into the database.Linear compilation timeOnly has to be done once not for each query.

  • STINGTake advantage of the statistical information captured.Only go through relevant cells at each level.Root is relevant.For each relevant cell, we exam its children at next level by statistical test and label them as relevant or not relevant.Form regions from relevant leaf level cells.Do not need to access full database.It is very efficient.

  • STINGThe computational complexity of STING is linearly proportional to the number of leaf cells.We used the SEQUOIA 2000 benchmark as the data set to compare the performance of STING with other approaches.

  • STINGSTING is a query-independent approach.The statistical information exists independently of queries.STING has a much smaller response time compared to other approachesThe computational complexity is linearly proportional to the number of leaves.I/O cost is low.STING can support different resolution of query result.Regions returned by STING approach that returned by DBSCAN when the granularity approaches zero.Parameters in the hierarchical structure can be maintained efficiently by incremental update.

  • OutlineSTING: a statistical information grid approach to spatial data miningSTING+: an approach to active spatial data miningPK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

  • STING+Moreover, since objects evolve, interesting patterns may emerge or disappear over time.Example: Trigger: Do bandwidth reallocation when the average call length is greater than 10 minutes within the region where at least 10 cellular phones are in use per squared mile. This can not be supported by traditional database triggers efficientlydue to the fact that the class membership of an object is not only determined by its non-spatial attributes but also by the attributes of objects in its neighborhood.Nave approach: re-evaluate condition periodically.Not efficient.

  • STING+STING+ was an extension of STING to support user-defined trigger. In spatial databases, object insertion, deletion, and update are primitive events.Observation: Usually, only the cumulative effect of a set of primitive events may cause the trigger condition to be true.

    We refer such set of primitive events to as a composite events.

  • STING+Condition-Action paradigmIn general, it is difficult or even impossible for user to specify all possible composite events that may cause the trigger condition to be true.In general, evaluating a user-defined trigger T usually involves two aspects:Find a set of composite events E(s) that may cause the trigger condition CT to become true. Each time some composite event in E(s) occurs, check the status (false or true) of CT (given that CT was false previously).

  • STING+Observation: As a side effect of the occurrence of some composite event, the set of composite events E(s) that could cause CT to transition from false to true might also evolve over time.

    Two set of composite events we need to consider:the set of composite events E(s) that can cause CT to become trueneed to re-evaluate CTthe set of composite events F(s) that can cause a change to E(s)need to update E(s)

  • STING+Observation: In spatial databases, the effect of an event is usually local to its neighborhood.

    STING+ decomposes the user-defined trigger into a set of sub-triggers associated with individual cells in the hierarchical structure.These sub-triggers are used to monitor composite events in E(s) and F(s) and change accordingly when E(s) and F(s) evolves.Level 4Level 3

  • STING+

    Updates are suspended at some level in the hierarchy until such time that the cumulative effect of these updates might cause the trigger condition to become satisfied.

    Level 2Level 1Level 1Level 2Level 3

  • STING+Example: Trigger bandwidth reallocation when the total area occupied by those regions in California where at least 10 cellular phones are in use per squared mile and the average length of phone calls is at least 15 minutes with total area at least 50 squared miles increases by at least 10 squared miles.DEFINE TRIGGER exampleON cellular-phoneWHEN SELECT SIZE(REGION) INCREASE RANGE (10, )WHERE DENSITY IN RANGE (10, )AND AVERAGE(length) IN RANGE (15, )AND AREA IN RANGE (50, )LOCATION CaliforniaDO bandwidth-reallocation

  • STING+Observation: Trigger condition CT is a conjunction of predicates P1 P2 Pn and can not be true if one predicate is false.They can be evaluated in a certain order: the ith predicate is tested when all previous i -1 predicates are true.The evaluation order should be chosen in such a way that the total cost is minimum.STING+ evaluates CT in the order {location, density condition, attribute condition}, each of which is evaluated in a different phase.Location only needs to be evaluated once and the cost can be regarded as constant in the trigger evaluation process.If the location is fixed, unnecessary sub-triggers set on cells outside the location can be avoided and hence save the evaluation cost of other predicates.Sub-triggers set during an earlier phase will exist longer than those set in a later phase.It is better to first evaluate the predicate that takes less time to handle.cost(density) < cost(attribute)

  • STING+Average CPU cycles for handling each type of sub-trigger

    Density condition

    Attribute condition

    Movement

    insertion

    deletion

    inside

    outside

    expand

    shrink

    Inter-mediate level

    3812

    3803

    3789

    3807

    N/A

    N/A

    Leaf

    level

    8055

    5775

    11212

    8164

    2126

    2087

  • STING+

  • OutlineSTING: a statistical information grid approach to spatial data miningSTING+: an approach to active spatial data miningPK-tree: a spatial index structure for high dimensional point data Temporal spatial pattern detection

  • PK-treeAs both the number of objects and the number of attributes are very large, it is essential to organize the set of objects by some dynamic indexing structure.Point index methods

    Index Method

    Overlapping Siblings

    Height

    Bounded Node Size

    Bounded

    Storage

    PR Quad-tree

    No

    Unbounded

    Yes

    No

    K-D-B-tree

    No

    Unbounded

    Yes

    No

    SR-tree

    Yes

    log(N)

    Yes

    Yes

    X-tree

    Yes

    log(N)

    No

    Yes

    PK-tree

    No

    log(N)

    Yes

    Yes

  • PK-treeSpatial decomposition: Space is recursively divided until a level LD such that each cell contains at most one point.

  • PK-tree16 intermediate nodes, height = 3

  • PK-tree5 intermediate nodes, height = 2

  • PK-treePK-tree employs a concept of K-instantiable cell to eliminate unnecessary nodes.Point cell: a non-empty cell at level LDA cell C is K-instantiable iffC is a point cell, orthere does not exist (K-1) or less K-instantiable sub-cells to cover all non-empty space in COnly K-instantiable cells serve as nodes in the PK-tree (expect the root).The parent-child relationship follows naturally from the cell-subcell relationship.

  • PK-treeProperties:Bounds on nodes outdegreeallows allocating one node to a pageBounded storage spaceExistence and Uniquenessenables us to analyze the behavior of a PK-tree easier.Expected heightlog(N) under some general conditionguarantees efficiency of retrieval and update.No overlapping among sibling nodesefficient retrievalEmpirical studies shown that the PK-tree outperforms SR-tree and X-tree by a wide margin.

  • PK-treeHeight of generated trees on 100,000 points Size of index in MB

    Dimension

    2

    4

    8

    16

    32

    64

    PK-tree (u)

    4

    4

    5

    6

    7

    9

    PK-tree (c1)

    5

    7

    7

    6

    7

    8

    PK-tree (c2)

    7

    7

    6

    7

    8

    9

    X-tree

    4

    4

    4

    4

    5

    6

    SR-tree

    4

    4

    5

    5

    6

    7

    Dimension

    2

    4

    8

    16

    u

    c1

    c2

    u

    c1

    c2

    u

    c1

    c2

    u

    c1

    c2

    PK-tree

    1.8

    1.9

    1.9

    2.8

    2.8

    2.8

    4.9

    4.8

    4.9

    9.4

    9.3

    9.4

    X-tree

    1.8

    1.8

    1.8

    3.0

    3.0

    3.0

    5.6

    5.5

    5.6

    10.7

    10.4

    10.6

    SR-tree

    69

    70

    70

    74

    73

    74

    74

    74

    75

    90

    91

    92

  • PK-treeKNN query on clustered data distribution

  • PK-treeReal data set: NASA Sky Telescope Data200,000 two-dimensional points (they are the coordinates of crater locations on the surface of Mars)

    height

    size

    KNN

    CPU

    KNN

    I/O

    RAN

    CPU

    RAN

    I/O

    PK-tree

    5

    3.7MB

    4ms

    4

    3ms

    4

    X-tree

    4

    5.7MB

    90ms

    4

    10ms

    4

    SR-tree

    5

    120MB

    28ms

    8

    14ms

    6

  • OutlineSTING: a statistical information grid approach to spatial data miningSTING+: an approach to active spatial data miningPK-tree: a spatial index structure for high dimensional point dataTemporal spatial pattern detection

  • Temporal Spatial Pattern DetectionWhen the number of attributes is large and/or the value of attributes evolve frequently, the complexity of patterns and the number of potential patterns increase.It is not desirable or even feasible to ask the user specify interesting patterns.E.g., the user wants to know any possible patterns involving certain attributes such as salary, rent, cellular phone usage, etc. Existing association rule algorithm can not be applied.Continuous attribute domainTemporal evolutionPrior knowledge about relationships among attributes and objects

  • Temporal Spatial Pattern DetectionObject represented by pointprimitive attributesspatial attributes, i.e., coordinates of its positionnon-spatial attributes, e.g., name, weight, height, salary, rentderived attributes derived from primitive attribute(s)environment attributes, e.g., distance to a hospital, average income in the neighborhood areaConsider a sequence of snapshots S1, S2, , SnTemporal Spatial Patterndescribes a possible relationship among evolution of attributesE.g., if the user want to know patterns involving salary and distance to big city, then one interesting pattern would be people receiving a raise tends to move further away from the big city from 1987 to 1993..

  • Temporal Spatial Pattern DetectionMore complicated patternsPatterns on clustering evolutionPatterns of high orderPatterns whose cause and consequence do not happen togetherThere is a delay for the consequence to show up.Patterns involving relationships among objectse.g., people who live far away from any doctor tend to move to a place closer to some doctor.Environment variables evolve independently over time.