Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Post on 17-Jan-2016

214 views 0 download

Tags:

Transcript of Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.

Data Projections &Visualization

Rajmonda CaceresMIT Lincoln Laboratory

Dimensionality Reduction

Reduce complexityVisualComputational

Identify the intrinsic dimensionality of data

Identify the most relevant aspects of data given a task

The Curse of Dimensionality

Lower Dimension

Higher Dimension

Data Projections

a) b)

Not all projections are equal

Data Projections

Desired propertiesReduced, compressed representationPreserved useful/intrinsic properties of the dataApplify patterns of interest (e.g. outliers)Simple, interpretable

Trade-off between simplicity and preservation of structure

Distance Function

Helps us organize the data

Helps us discriminate patterns

Distance Functions

Manhattan distance (1 norm, taxicab distance)

Euclidean distance (2 norm)

L-p Distance

Distance Functions

As p grows the largest coordinate distances tends to dominate the global distance

Distance Functions

Data Projections

Projective methods: preserve a property of dataPrincipal Component Analysis (PCA)Many others: ICA, Factor Analysis,

Manifold LearningMultidimensional Dimension Reduction (MDS) LLE, Isomap

Principal Component Analysis

Goal: Find a linear projection that captures most of variance

1st Principal Component

2nd Principal Component1st Principal Component

Principal Component Analysis

PCA pseudo code:Centralize the data by subtracting the meanCalculate the covariance matrix:

Calculate the eigenvectors(principal components) of the covariance matrixSelect top few(2-3) eigenvectors (highest eigenvalues)Project the data using these eigenvectors as axis

PCA on IRIS Dataset

Screeplot Biplot

Multidimensional Scaling

Goal: Find a lower embedding of the data that preserves pairwise distances

Formally:

: Input distance values

: Output distances values

MDS Projection of Us Capitals

Goodness of MDS SolutionShepard Diagram

MDS Distances

Dat

a D

istan

ces

Takeaways

More features are not necessarily better

Understand the assumptions of different modeling choices

When choosing distance functions, projection methodsConsider the characteristics of the data Consider the learning objective

Explore multiple choices simultaneously to gain better insight

Referenceshttp://statweb.stanford.edu/~jtaylo/courses/stats202/mds.html

https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clustering-in-r/

Multidimensional Scaling, Leland Wilkinson

Dimension Reduction: A Guided Tour, Christopher J.C. Burgesti

When is “nearest neighbor” meaningful?, Beyer, K.S., GoldStein, J. Ramakrishnan, R. & Shaft g, by