Making BIG DATA smaller

Making BIGDATA smallerTony Tran

About Me● I am from SF

○ SF Bay Area Machine Learning Meetup○ www.sfdatajobs.com

● Background:○ BS/MS CompSci (focus on ML/Vision)○ 4 years as Data Engineer in Ad Tech○ Currently Consulting

http://www.sfdatajobs.com

http://www.sfdatajobs.com

What does “data” mean?RAW DATAstructured &unstructured

Clean & Transform

Extract Features

DATAobservations and

features

obs

erva

tions

features

DATA(matrix)

D features

m observations

What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia

http://en.wikipedia.org/wiki/Data_set

http://www.logicplus.com.au/wp-content/uploads/2013/05/dedicated-servers.jpg

What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia

To me, Big Data is when:● run out of disc space● run into “Out Of Memory” Errors● S3 billing triggers your credit company to call you● you are willing to go through the pains of setting up a hadoop/spark/etc

cluster (have you tried configuring your own cluster?)

http://en.wikipedia.org/wiki/Data_set

https://pbs.twimg.com/profile_images/1252505253/elephant_rgb_sq.png

http://landoftechnology.com/wp-content/uploads/2013/03/laptop_smoking.gif

Question● Can we make big data smaller?● What are the benefits of smaller data?

http://cdn3.volusion.com/jants.petuy/v/vspfiles/photos/10315-2.jpg?1345475144

http://cdn3.volusion.com/jants.petuy/v/vspfiles/photos/10315-2.jpg?1345475144

Question● Can we make big data smaller?● Benefits of having “small data”:

○ Reduce storage costs○ Reduce computational costs○ No more “Out Of Memory” Errors

Ideas for making data small● Reduce the number of observations

○ Only keep observations that are “important”○ Remove redundant observations○ Randomly sample

● Reduce the number of features○ removing non-useful features○ combining features○ something clever

Random sampling of obsrv.

Ideas for making data small● Reduce the number of observations

○ Only keep rows that are “important”○ Remove redundant rows○ Randomly sample

● Reduce the number of features○ removing non-useful features○ combining features○ something clever

http://www.pro-fighter.com/prod_images/img_499.jpg

http://images.onlinelabels.com/images/clip-art/deelight/deelight_Magnifying_Glass.png

Our Focus: reducing feat.

m

D

m

d

DimensionalityReduction

Note: d << D

Dimensionality reduction

Note: we want to preserve the distances between observations as best as possible.

ExerciseGiven the following set of 2d observations, how can we represent each observation in 1d while still preserving the distances between points as best as possible?

Exercise

ExerciseProjecting observations onto x-axis

ExerciseProjecting observations onto y-axis

Will it still work if ...

Are we getting good results because the x-axis is aligned with the spread of the observations?

Unaligned data

Can we find a better coordinate system that is aligned with the data?

Aligned coordinate systemDirection which aligns with the spread of observations

● dimensionality reduction easy with aligned coordinate system

Computing Spread

● Spread = variance of projected observations● How to determine observations in new coordinate

system?

Linear Algebrav1v2

a1 = v1_ * p ||v1||

p

a2 = v2_ * p ||v2||

p = (p1, p2) originally

p = (a1, a2 ) in new coordinate system

http://www.pro-fighter.com/prod_images/img_499.jpg

Observations● Finding an aligned coordinate system makes

it easy for us to do dimensionality reduction.○ represent observations in new coordinate system

then remove features (axes).● The direction parallel to spread of data

maximizes interpoint distances.

New tool (PCA)● How do we find aligned coordinate system?

Is there a tool already developed for this?

New tool (PCA)● How do we find aligned coordinate system?● Principal Component Analysis

○ Given set of observations, finds an aligned coordinate system.

○ First direction of coordinate system will contain the most spread, followed by the second, so forth.

○ O(m3) runtime

PCA (scikit-learn)>>> X = … data matrix of size (m x D) ...

>>> from sklearn.decomposition import PCA

>>>

>>> pca = PCA(n_components=d)

>>> pca.fit(X) # fits new coordinate system to data

>>> pca.transform(X) # transforms data to new coordinate

# system and removes dimensions

>>>> gives us matrix of size (m x d)

Our Focus: reducing feat.

m

D

m

d


Note: d << D

Dimensionality reduction

Note: we want to preserve the distances between observations as best as possible.

3D to 2D

v1v2

v3

keep only v1 and v2

projected data space

http://www.nlpca.org/fig_pca_principal_component_analysis.png

Image Data● What if our data is images? How do we

represent an image as an observation?

(100x100) matrix

Images and vectors

(100x100) matrix

10k-dimensional vector

Image Data

(100x100) matrix


(100x100) matrix


r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

Image Data

(100x100) matrix


(100x100) matrix


r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

...

Image Data

(100x100) matrix


(100x100) matrix


r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

...

v999

9

v4

v2

v1

v10k

v1

v3

Image Datav1

v2

v10k

...

r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

v999

9

v4

v2

v1

v10k

v1

v3

Image Data

Image Data

= a1 + a2 +…+ a10kOriginal Image

Image Data


reconstruct with 20 directions

= a1 + …+ a20

Image Data


reconstruct with 20 directions

= a1 + …+ a20

= a1 + …+ a90reconstruct with 90 directions

Image Data



Compression!Each image can now be represented by 90 weights!

Image Data

● original image representation: 10k values● compression requires:

○ 90 direction vectors = (90 x 10k values)○ 1 image = 90 weights (for the direction vectors)


Image Data

● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90


Image Data

● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90


makes sense to use this compression technique whenwe have more than 90 imagesto compress

Keep in mind● O(m3) runtime● Need to keep around d directions of length D to

perform projection.● Requires to be able to read in data to memory.● What if data is non-linear?

http://upload.wikimedia.org/wikipedia/commons/thumb/b/bb/SOMsPCA.PNG/200px-SOMsPCA.PNG

Random Projections● Generate a (d x D) matrix, P, where

elements are drawn from a normal distribution ~ N(0, 1/d)

● To compute projected observation:○ onew = P.dot(o)

Pd

D

= * D d

Intuition● Randomly determine coordinate system.

Intuition● Randomly determine coordinate system.● Keep d directions

Safe value for d?Using this technique, what is a “safe” value for d?

m

D

m

d


Note: d << D

Random Projection

Safe value for d?>> from sklearn.random_projection import johnson_lindenstrauss_min_dim

def johnson_lindenstrauss_min_dim(n_observations,eps=0.1):

● input:

○ n_observations -- the number of observations you have

○ eps -- the amount of error you’re willing to tolerate

● output:

○ safe number of features that you can project down to

Mathematical Guarantees

d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)

Original distance

Projected distance

Practical usage● High probability that projection will be good,

but there’s still a chance that it will not be!○ Create multiple projections and test guarantees with

sampled observations.

Comparison● PCA

○ finds aligned coordinate system which maximizes spread.○ o(m3) runtime + requires all points to be read into memory.○ o(dD) space to store aligned coordinate system for projection.

● Random Projection○ finds random coordinate system○ o(dD) runtime and space to construct projection Matrix○ Guaranteed with high probability to work

Thank you

@quicksorter

Tony [email protected]

http://simplyzesty.com/wp-content/uploads//2011/11/twitter_newbird_boxed_blueonwhite1.png

References● http://www.cs.princeton.edu/~cdecoro/eigenfaces/● http://scikit-learn.org/stable/modules/generated/sklearn.

decomposition.PCA.html#sklearn.decomposition.PCA● http://scikit-learn.org/stable/modules/random_projection.

html● http://blog.yhathq.com/posts/sparse-random-

projections.html

http://www.cs.princeton.edu/~cdecoro/eigenfaces/

http://www.cs.princeton.edu/~cdecoro/eigenfaces/

http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA



http://scikit-learn.org/stable/modules/random_projection.html



http://blog.yhathq.com/posts/sparse-random-projections.html



Making BIG DATA smaller

Technology

Transcript of Making BIG DATA smaller