Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.
-
Upload
mervyn-greene -
Category
Documents
-
view
214 -
download
0
Transcript of Data Projections & Visualization Rajmonda Caceres MIT Lincoln Laboratory.
Data Projections &Visualization
Rajmonda CaceresMIT Lincoln Laboratory
Dimensionality Reduction
Reduce complexityVisualComputational
Identify the intrinsic dimensionality of data
Identify the most relevant aspects of data given a task
The Curse of Dimensionality
Lower Dimension
Higher Dimension
Data Projections
a) b)
Not all projections are equal
Data Projections
Desired propertiesReduced, compressed representationPreserved useful/intrinsic properties of the dataApplify patterns of interest (e.g. outliers)Simple, interpretable
Trade-off between simplicity and preservation of structure
Distance Function
Helps us organize the data
Helps us discriminate patterns
Distance Functions
Manhattan distance (1 norm, taxicab distance)
Euclidean distance (2 norm)
L-p Distance
Distance Functions
As p grows the largest coordinate distances tends to dominate the global distance
Distance Functions
Data Projections
Projective methods: preserve a property of dataPrincipal Component Analysis (PCA)Many others: ICA, Factor Analysis,
Manifold LearningMultidimensional Dimension Reduction (MDS) LLE, Isomap
Principal Component Analysis
Goal: Find a linear projection that captures most of variance
1st Principal Component
2nd Principal Component1st Principal Component
Principal Component Analysis
PCA pseudo code:Centralize the data by subtracting the meanCalculate the covariance matrix:
Calculate the eigenvectors(principal components) of the covariance matrixSelect top few(2-3) eigenvectors (highest eigenvalues)Project the data using these eigenvectors as axis
PCA on IRIS Dataset
Screeplot Biplot
Multidimensional Scaling
Goal: Find a lower embedding of the data that preserves pairwise distances
Formally:
: Input distance values
: Output distances values
MDS Projection of Us Capitals
Goodness of MDS SolutionShepard Diagram
MDS Distances
Dat
a D
istan
ces
Takeaways
More features are not necessarily better
Understand the assumptions of different modeling choices
When choosing distance functions, projection methodsConsider the characteristics of the data Consider the learning objective
Explore multiple choices simultaneously to gain better insight
Referenceshttp://statweb.stanford.edu/~jtaylo/courses/stats202/mds.html
https://planspacedotorg.wordpress.com/2013/02/03/pca-3d-visualization-and-clustering-in-r/
Multidimensional Scaling, Leland Wilkinson
Dimension Reduction: A Guided Tour, Christopher J.C. Burgesti
When is “nearest neighbor” meaningful?, Beyer, K.S., GoldStein, J. Ramakrishnan, R. & Shaft g, by