Making BIG DATA smaller
-
Upload
tony-tran -
Category
Technology
-
view
362 -
download
1
Transcript of Making BIG DATA smaller
Making BIGDATA smallerTony Tran
About Me● I am from SF
○ SF Bay Area Machine Learning Meetup○ www.sfdatajobs.com
● Background:○ BS/MS CompSci (focus on ML/Vision)○ 4 years as Data Engineer in Ad Tech○ Currently Consulting
What does “data” mean?RAW DATAstructured &unstructured
Clean & Transform
Extract Features
DATAobservations and
features
obs
erva
tions
features
DATA(matrix)
D features
m observations
What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia
What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia
To me, Big Data is when:● run out of disc space● run into “Out Of Memory” Errors● S3 billing triggers your credit company to call you● you are willing to go through the pains of setting up a hadoop/spark/etc
cluster (have you tried configuring your own cluster?)
Question● Can we make big data smaller?● What are the benefits of smaller data?
Question● Can we make big data smaller?● Benefits of having “small data”:
○ Reduce storage costs○ Reduce computational costs○ No more “Out Of Memory” Errors
Ideas for making data small● Reduce the number of observations
○ Only keep observations that are “important”○ Remove redundant observations○ Randomly sample
● Reduce the number of features○ removing non-useful features○ combining features○ something clever
Random sampling of obsrv.
Ideas for making data small● Reduce the number of observations
○ Only keep rows that are “important”○ Remove redundant rows○ Randomly sample
● Reduce the number of features○ removing non-useful features○ combining features○ something clever
Our Focus: reducing feat.
m
D
m
d
DimensionalityReduction
Note: d << D
Dimensionality reduction
Note: we want to preserve the distances between observations as best as possible.
ExerciseGiven the following set of 2d observations, how can we represent each observation in 1d while still preserving the distances between points as best as possible?
Exercise
Exercise
ExerciseProjecting observations onto x-axis
ExerciseProjecting observations onto y-axis
Will it still work if ...
Are we getting good results because the x-axis is aligned with the spread of the observations?
Unaligned data
Can we find a better coordinate system that is aligned with the data?
Aligned coordinate systemDirection which aligns with the spread of observations
● dimensionality reduction easy with aligned coordinate system
Computing Spread
● Spread = variance of projected observations● How to determine observations in new coordinate
system?
Linear Algebrav1v2
a1 = v1_ * p ||v1||
p
a2 = v2_ * p ||v2||
p = (p1, p2) originally
p = (a1, a2 ) in new coordinate system
Observations● Finding an aligned coordinate system makes
it easy for us to do dimensionality reduction.○ represent observations in new coordinate system
then remove features (axes).● The direction parallel to spread of data
maximizes interpoint distances.
New tool (PCA)● How do we find aligned coordinate system?
Is there a tool already developed for this?
New tool (PCA)● How do we find aligned coordinate system?● Principal Component Analysis
○ Given set of observations, finds an aligned coordinate system.
○ First direction of coordinate system will contain the most spread, followed by the second, so forth.
○ O(m3) runtime
PCA (scikit-learn)>>> X = … data matrix of size (m x D) ...
>>> from sklearn.decomposition import PCA
>>>
>>> pca = PCA(n_components=d)
>>> pca.fit(X) # fits new coordinate system to data
>>> pca.transform(X) # transforms data to new coordinate
# system and removes dimensions
>>>> gives us matrix of size (m x d)
Our Focus: reducing feat.
m
D
m
d
DimensionalityReduction
Note: d << D
Dimensionality reduction
Note: we want to preserve the distances between observations as best as possible.
3D to 2D
v1v2
v3
keep only v1 and v2
projected data space
Image Data● What if our data is images? How do we
represent an image as an observation?
(100x100) matrix
Images and vectors
(100x100) matrix
10k-dimensional vector
Image Data
(100x100) matrix
10k-dimensional vector
(100x100) matrix
10k-dimensional vector
r3
r1
r2
r4 r5 ...r9998
r9999
r10000
...
Image Data
(100x100) matrix
10k-dimensional vector
(100x100) matrix
10k-dimensional vector
r3
r1
r2
r4 r5 ...r9998
r9999
r10000
...
...
Image Data
(100x100) matrix
10k-dimensional vector
(100x100) matrix
10k-dimensional vector
r3
r1
r2
r4 r5 ...r9998
r9999
r10000
...
...
v999
9
v4
v2
v1
v10k
v1
v3
Image Datav1
v2
v10k
...
r3
r1
r2
r4 r5 ...r9998
r9999
r10000
...
v999
9
v4
v2
v1
v10k
v1
v3
Image Data
Image Data
= a1 + a2 +…+ a10kOriginal Image
Image Data
= a1 + a2 +…+ a10kOriginal Image
reconstruct with 20 directions
= a1 + …+ a20
Image Data
= a1 + a2 +…+ a10kOriginal Image
reconstruct with 20 directions
= a1 + …+ a20
= a1 + …+ a90reconstruct with 90 directions
Image Data
= a1 + a2 +…+ a10kOriginal Image
= a1 + …+ a90reconstruct with 90 directions
Compression!Each image can now be represented by 90 weights!
Image Data
● original image representation: 10k values● compression requires:
○ 90 direction vectors = (90 x 10k values)○ 1 image = 90 weights (for the direction vectors)
= a1 + …+ a90reconstruct with 90 directions
Image Data
● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90
= a1 + …+ a90reconstruct with 90 directions
Image Data
● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90
= a1 + …+ a90reconstruct with 90 directions
makes sense to use this compression technique whenwe have more than 90 imagesto compress
Keep in mind● O(m3) runtime● Need to keep around d directions of length D to
perform projection.● Requires to be able to read in data to memory.● What if data is non-linear?
Random Projections● Generate a (d x D) matrix, P, where
elements are drawn from a normal distribution ~ N(0, 1/d)
● To compute projected observation:○ onew = P.dot(o)
Pd
D
= * D d
Intuition● Randomly determine coordinate system.
Intuition● Randomly determine coordinate system.
Intuition● Randomly determine coordinate system.
Intuition● Randomly determine coordinate system.● Keep d directions
Safe value for d?Using this technique, what is a “safe” value for d?
m
D
m
d
DimensionalityReduction
Note: d << D
Random Projection
Safe value for d?>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
def johnson_lindenstrauss_min_dim(n_observations,eps=0.1):
● input:
○ n_observations -- the number of observations you have
○ eps -- the amount of error you’re willing to tolerate
● output:
○ safe number of features that you can project down to
Mathematical Guarantees
d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)
Original distance
Projected distance
Practical usage● High probability that projection will be good,
but there’s still a chance that it will not be!○ Create multiple projections and test guarantees with
sampled observations.
Comparison● PCA
○ finds aligned coordinate system which maximizes spread.○ o(m3) runtime + requires all points to be read into memory.○ o(dD) space to store aligned coordinate system for projection.
● Random Projection○ finds random coordinate system○ o(dD) runtime and space to construct projection Matrix○ Guaranteed with high probability to work
References● http://www.cs.princeton.edu/~cdecoro/eigenfaces/● http://scikit-learn.org/stable/modules/generated/sklearn.
decomposition.PCA.html#sklearn.decomposition.PCA● http://scikit-learn.org/stable/modules/random_projection.
html● http://blog.yhathq.com/posts/sparse-random-
projections.html