Making BIG DATA smaller

54
Making BIGDATA smaller Tony Tran

Transcript of Making BIG DATA smaller

Page 1: Making BIG DATA smaller

Making BIGDATA smallerTony Tran

Page 2: Making BIG DATA smaller

About Me● I am from SF

○ SF Bay Area Machine Learning Meetup○ www.sfdatajobs.com

● Background:○ BS/MS CompSci (focus on ML/Vision)○ 4 years as Data Engineer in Ad Tech○ Currently Consulting

Page 3: Making BIG DATA smaller

What does “data” mean?RAW DATAstructured &unstructured

Clean & Transform

Extract Features

DATAobservations and

features

obs

erva

tions

features

DATA(matrix)

D features

m observations

Page 4: Making BIG DATA smaller

What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia

Page 5: Making BIG DATA smaller

What is “Big Data?”Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process them using traditional data processing applications -- Wikipedia

To me, Big Data is when:● run out of disc space● run into “Out Of Memory” Errors● S3 billing triggers your credit company to call you● you are willing to go through the pains of setting up a hadoop/spark/etc

cluster (have you tried configuring your own cluster?)

Page 7: Making BIG DATA smaller

Question● Can we make big data smaller?● Benefits of having “small data”:

○ Reduce storage costs○ Reduce computational costs○ No more “Out Of Memory” Errors

Page 8: Making BIG DATA smaller

Ideas for making data small● Reduce the number of observations

○ Only keep observations that are “important”○ Remove redundant observations○ Randomly sample

● Reduce the number of features○ removing non-useful features○ combining features○ something clever

Page 9: Making BIG DATA smaller

Random sampling of obsrv.

Page 10: Making BIG DATA smaller

Ideas for making data small● Reduce the number of observations

○ Only keep rows that are “important”○ Remove redundant rows○ Randomly sample

● Reduce the number of features○ removing non-useful features○ combining features○ something clever

Page 11: Making BIG DATA smaller

Our Focus: reducing feat.

m

D

m

d

DimensionalityReduction

Note: d << D

Dimensionality reduction

Note: we want to preserve the distances between observations as best as possible.

Page 12: Making BIG DATA smaller

ExerciseGiven the following set of 2d observations, how can we represent each observation in 1d while still preserving the distances between points as best as possible?

Page 13: Making BIG DATA smaller

Exercise

Page 14: Making BIG DATA smaller

Exercise

Page 15: Making BIG DATA smaller

ExerciseProjecting observations onto x-axis

Page 16: Making BIG DATA smaller

ExerciseProjecting observations onto y-axis

Page 17: Making BIG DATA smaller

Will it still work if ...

Are we getting good results because the x-axis is aligned with the spread of the observations?

Page 18: Making BIG DATA smaller

Unaligned data

Can we find a better coordinate system that is aligned with the data?

Page 19: Making BIG DATA smaller

Aligned coordinate systemDirection which aligns with the spread of observations

● dimensionality reduction easy with aligned coordinate system

Page 20: Making BIG DATA smaller

Computing Spread

● Spread = variance of projected observations● How to determine observations in new coordinate

system?

Page 21: Making BIG DATA smaller

Linear Algebrav1v2

a1 = v1_ * p ||v1||

p

a2 = v2_ * p ||v2||

p = (p1, p2) originally

p = (a1, a2 ) in new coordinate system

Page 22: Making BIG DATA smaller

Observations● Finding an aligned coordinate system makes

it easy for us to do dimensionality reduction.○ represent observations in new coordinate system

then remove features (axes).● The direction parallel to spread of data

maximizes interpoint distances.

Page 23: Making BIG DATA smaller

New tool (PCA)● How do we find aligned coordinate system?

Is there a tool already developed for this?

Page 24: Making BIG DATA smaller

New tool (PCA)● How do we find aligned coordinate system?● Principal Component Analysis

○ Given set of observations, finds an aligned coordinate system.

○ First direction of coordinate system will contain the most spread, followed by the second, so forth.

○ O(m3) runtime

Page 25: Making BIG DATA smaller

PCA (scikit-learn)>>> X = … data matrix of size (m x D) ...

>>> from sklearn.decomposition import PCA

>>>

>>> pca = PCA(n_components=d)

>>> pca.fit(X) # fits new coordinate system to data

>>> pca.transform(X) # transforms data to new coordinate

# system and removes dimensions

>>>> gives us matrix of size (m x d)

Page 26: Making BIG DATA smaller

Our Focus: reducing feat.

m

D

m

d

DimensionalityReduction

Note: d << D

Dimensionality reduction

Note: we want to preserve the distances between observations as best as possible.

Page 27: Making BIG DATA smaller

3D to 2D

v1v2

v3

keep only v1 and v2

projected data space

Page 28: Making BIG DATA smaller

Image Data● What if our data is images? How do we

represent an image as an observation?

(100x100) matrix

Page 29: Making BIG DATA smaller

Images and vectors

(100x100) matrix

10k-dimensional vector

Page 30: Making BIG DATA smaller

Image Data

(100x100) matrix

10k-dimensional vector

(100x100) matrix

10k-dimensional vector

r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

Page 31: Making BIG DATA smaller

Image Data

(100x100) matrix

10k-dimensional vector

(100x100) matrix

10k-dimensional vector

r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

...

Page 32: Making BIG DATA smaller

Image Data

(100x100) matrix

10k-dimensional vector

(100x100) matrix

10k-dimensional vector

r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

...

v999

9

v4

v2

v1

v10k

v1

v3

Page 33: Making BIG DATA smaller

Image Datav1

v2

v10k

...

r3

r1

r2

r4 r5 ...r9998

r9999

r10000

...

v999

9

v4

v2

v1

v10k

v1

v3

Page 34: Making BIG DATA smaller

Image Data

Page 35: Making BIG DATA smaller

Image Data

= a1 + a2 +…+ a10kOriginal Image

Page 36: Making BIG DATA smaller

Image Data

= a1 + a2 +…+ a10kOriginal Image

reconstruct with 20 directions

= a1 + …+ a20

Page 37: Making BIG DATA smaller

Image Data

= a1 + a2 +…+ a10kOriginal Image

reconstruct with 20 directions

= a1 + …+ a20

= a1 + …+ a90reconstruct with 90 directions

Page 38: Making BIG DATA smaller

Image Data

= a1 + a2 +…+ a10kOriginal Image

= a1 + …+ a90reconstruct with 90 directions

Compression!Each image can now be represented by 90 weights!

Page 39: Making BIG DATA smaller

Image Data

● original image representation: 10k values● compression requires:

○ 90 direction vectors = (90 x 10k values)○ 1 image = 90 weights (for the direction vectors)

= a1 + …+ a90reconstruct with 90 directions

Page 40: Making BIG DATA smaller

Image Data

● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90

= a1 + …+ a90reconstruct with 90 directions

Page 41: Making BIG DATA smaller

Image Data

● For 200 images:○ original representation: 200*10k○ compression: (90x10k) + 200*90

= a1 + …+ a90reconstruct with 90 directions

makes sense to use this compression technique whenwe have more than 90 imagesto compress

Page 42: Making BIG DATA smaller

Keep in mind● O(m3) runtime● Need to keep around d directions of length D to

perform projection.● Requires to be able to read in data to memory.● What if data is non-linear?

Page 43: Making BIG DATA smaller

Random Projections● Generate a (d x D) matrix, P, where

elements are drawn from a normal distribution ~ N(0, 1/d)

● To compute projected observation:○ onew = P.dot(o)

Pd

D

= * D d

Page 44: Making BIG DATA smaller

Intuition● Randomly determine coordinate system.

Page 45: Making BIG DATA smaller

Intuition● Randomly determine coordinate system.

Page 46: Making BIG DATA smaller

Intuition● Randomly determine coordinate system.

Page 47: Making BIG DATA smaller

Intuition● Randomly determine coordinate system.● Keep d directions

Page 48: Making BIG DATA smaller

Safe value for d?Using this technique, what is a “safe” value for d?

m

D

m

d

DimensionalityReduction

Note: d << D

Random Projection

Page 49: Making BIG DATA smaller

Safe value for d?>> from sklearn.random_projection import johnson_lindenstrauss_min_dim

def johnson_lindenstrauss_min_dim(n_observations,eps=0.1):

● input:

○ n_observations -- the number of observations you have

○ eps -- the amount of error you’re willing to tolerate

● output:

○ safe number of features that you can project down to

Page 50: Making BIG DATA smaller

Mathematical Guarantees

d >= 4 log(m) / (eps^2 / 2 - eps^3 / 3)

Original distance

Projected distance

Page 51: Making BIG DATA smaller

Practical usage● High probability that projection will be good,

but there’s still a chance that it will not be!○ Create multiple projections and test guarantees with

sampled observations.

Page 52: Making BIG DATA smaller

Comparison● PCA

○ finds aligned coordinate system which maximizes spread.○ o(m3) runtime + requires all points to be read into memory.○ o(dD) space to store aligned coordinate system for projection.

● Random Projection○ finds random coordinate system○ o(dD) runtime and space to construct projection Matrix○ Guaranteed with high probability to work