Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

18
Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory

Transcript of Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Page 1: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Data Projections &Visualization

Rajmonda CaceresMIT Lincoln Laboratory

Page 2: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Dimensionality Reduction

Reduce complexityVisualComputational

Identify the intrinsic dimensionality of data

Identify the most relevant aspects of data given a task

Page 3: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

The Curse of Dimensionality

Lower Dimension

Higher Dimension

Page 4: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Data Projections

a) b)

Not all projections are equal

Page 5: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Data Projections

Desired propertiesReduced, compressed representationPreserved useful/intrinsic properties of the dataApplify patterns of interest (e.g. outliers)Simple, interpretable

Trade-off between simplicity and preservation of structure

Page 6: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Distance Function

Helps us organize the data

Helps us discriminate patterns

Page 7: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Distance Functions

Manhattan distance (1 norm, taxicab distance)

Euclidean distance (2 norm)

Page 8: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

L-p Distance

Distance Functions

As p grows the largest coordinate distances tends to dominate the global distance

Page 9: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Distance Functions

Page 10: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Data Projections

Projective methods: preserve a property of dataPrincipal Component Analysis (PCA)Many others: ICA, Factor Analysis,

Manifold LearningMultidimensional Dimension Reduction (MDS) LLE, Isomap

Page 11: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Principal Component Analysis

Goal: Find a linear projection that captures most of variance

1st Principal Component

2nd Principal Component1st Principal Component

Page 12: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Principal Component Analysis

PCA pseudo code:Centralize the data by subtracting the meanCalculate the covariance matrix:

Calculate the eigenvectors(principal components) of the covariance matrixSelect top few(2-3) eigenvectors (highest eigenvalues)Project the data using these eigenvectors as axis

Page 13: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

PCA on IRIS Dataset

Screeplot Biplot

Page 14: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Multidimensional Scaling

Goal: Find a lower embedding of the data that preserves pairwise distances

Formally:

: Input distance values

: Output distances values

Page 15: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

MDS Projection of Us Capitals

Page 16: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Goodness of MDS SolutionShepard Diagram

MDS Distances

Dat

a D

istan

ces

Page 17: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Takeaways

More features are not necessarily better

Understand the assumptions of different modeling choices

When choosing distance functions, projection methodsConsider the characteristics of the data Consider the learning objective

Explore multiple choices simultaneously to gain better insight

Page 18: Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Referenceshttp://statweb.stanford.edu/~jtaylo/courses/stats202/mds.html

https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clustering-in-r/

Multidimensional Scaling, Leland Wilkinson

Dimension Reduction: A Guided Tour, Christopher J.C. Burgesti

When is “nearest neighbor” meaningful?, Beyer, K.S., GoldStein, J. Ramakrishnan, R. & Shaft g, by