Visual Data Mining for Quantized, Spatial Data€¦ · Visual Data Mining for Quantized, Spatial...
Transcript of Visual Data Mining for Quantized, Spatial Data€¦ · Visual Data Mining for Quantized, Spatial...
Visual Data Mining for Quantized, Spatial Data
Amy BravermanJet Propulsion Laboratory
California Institute of TechnologyMail Stop 169-237
4800 Oak Grove DrivePasadena, CA 91109-8099
QuickTime™ and a Microsoft Video 1 decompressor are needed to see this picture.
Outline
1. Motivation.
2. Approach.
3. AIRS data collection.
4. Quantization.
5. Visual data mining (I).
6. Visual data mining (II).
7. Hierarchical Quantization.
8. Visual data mining (III).
9. Summary.
• Earth Observing System satellites return “massive” data volume.
• Traditional approach to data exploration: produce maps of one degree averages and standard deviations for each parameter of interest.
• Good news: this is easy, practical, and everybody understands it.
• Bad news: the method throws away almost all of the distributional information in the data including covariance and higher-order statistics.
• Need: to “mine” the data, i.e. how do characteristics of joint distributions change in (time and space) and across resolutions?Characterize forcings and feedbacks.
Motivation
• New approach: produce an estimate of the joint (empirical) probability distribution of variables of interest within each one degree grid cell.
• Use a clustering algorithm such as K-means to partition data into groups, represent each group by its centroid and (normalized) membership count.
• Collection of all 180 x 360 = 64,800 grid cell distribution estimates is a proxy for the original data.
• How to find relationships? We need to visualize multivariate relationships while maintaining spatial context.
Approach
AIRS Data Collection
QuickTime™ and a YUV420 codec decompressor are needed to see this picture.
135 footprints
AIRS Granules
1 degree lat/lon
1500 km
2250 km
90 footprints
Geographic space
Nk = 1[x ∈ k]n=1
N
∑
x1x2
xN 1
1
1
yK
y1
y 2
N1
N2
NK
yk =1
N k
xn1[x ∈ k]n=1
N
∑
X Y = E(X |Y )
High-dimensional data space
(!)
Quantization
Visual Data Mining (I)
• Data: 11 AIRS channels observed over 3 days (July 20-22, 2002).
• Compare joint distributions among grid cells:
• Are the grid cell data homogeneous or heterogeneous?
• What physical processes account for the shapes of the representatives and the distribution?
• What physical processes might account for differences between grid cells?
• Are there “outliers”?
• Data in this region: 10,498 clusters representing 60,681 observations.
• Can we summarize the whole region as one?
Visual Data Mining (II)
W
Y j = E(X j |Y j )
Y = Y jI(V = j)j=1
4
∑
P(V = j) =N j
N j∑
W = E (Y |W )
X = X jj=1
4
∑ I(V = j)
δ(X j ,Yj ) = E X j − Yj
2
δ(X,Y) = E X −Y 2
δ(Y,W ) = E Y −W 2
δ(X,W ) = δ(Y,W ) + δ(X j ,Y j )j=1
4
∑ P(V = j)
1 degree1 degreeX1
X2
X3 X 4
Y4
Y3
Y2
Y1
Y
Hierarchical Quantization
• From where do the clusters come?
More questions:
• How do the distributions change as you move from east to west? (Suggested approach: subdivide the region into western half and eastern half. Summarize separately and compare to each other and summary of the whole. Subdivide again, etc.) North to south?
• What other regions are similar to this one? Are they the ones we expect based on physics? Does spatial resolution matter for answering the question? If so, how?
• Where are the regions of high complexity (variability or distribution entropy)? Do the physics support this?
• How does the regression of channel 1 on channel 2 change spatially?
Visual Data Mining (III)
Summary
• Accept coarser spatial resolution (one degree) to achieve replication and estimate distributions.
• Explore quantized data interactively by comparing distributions at different levels of aggregation and in different locations (and times).
• We are mining the data, not making inferences. No spatial statistical models.
• AIRS data will be available at http://daac.gsfc.nasa.gov/atmodyn/airs/index.html.
• More information about AIRS: http://www-airs.jpl.nasa.gov.