Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2...

4/7/15

1

©Emily Fox 2015 1

Case Study 1: Es.ma.ng Click Probabili.es

Tackling an Unknown Number of Features with Sketching

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox April 9th, 2015

Problem 1: Complexity of LR Updates

•  LogisTc regression update:

•  Complexity of updates: –  Constant in number of data points –  In number of features?

•  Problem both in terms of computaTonal complexity and sample complexity

•  What can we with very high dimensional feature spaces? –  Kernels not always appropriate, or scalable –  What else?

©Emily Fox 2015 2

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

4/7/15

2

Problem 2: Unknown Number of Features

•  For example, bag-‐of-‐words features for text data: –  “Mary had a liZle lamb, liZle lamb…”

•  What’s the dimensionality of x? •  What if we see new word that was not in our vocabulary?

–  Obamacare

–  TheoreTcally, just keep going in your learning, and iniTalize wObamacare = 0 –  In pracTce, need to re-‐allocate memory, fix indices,… A big problem for Big Data

©Emily Fox 2015 3

•  Keep p by m Count matrix

•  p hash funcTons: –  Just like in Bloom Filter, decrease errors with mulTple hashes –  Every Tme see string i:

©Emily Fox 2015 4

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] + 1

Count-‐Min Sketch: general case

4/7/15

3

Querying the Count-‐Min Sketch

•  Query Q(i)? –  What is in Count[j,k]?

–  Thus:

–  Return:

©Emily Fox 2015 5

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] + 1

But updates may be posiTve or negaTve

•  Count-‐Min sketch for posiTve & negaTve case –  ai no longer necessarily posiTve

•  Update the same: Observe change Δi to element i:

–  Each Count[j,h(i)] no longer an upper bound on ai

•  How do we make a predicTon?

•  Bound: –  With probability at least 1-‐δ1/4, where ||a|| = Σi |ai|

©Emily Fox 2015 6

|ai � ai| 3✏||a||1

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] +�i

4/7/15

4

Finally, Sketching for LR

•  Never need to know size of vocabulary! •  At every iteraTon, update Count-‐Min matrix:

•  Making a predicTon:

•  Scales to huge problems, great pracTcal implicaTons… ©Emily Fox 2015 7

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Hash Kernels •  Count-‐Min sketch not designed for negaTve updates •  Biased esTmates of dot products

•  Hash Kernels: Very simple, but powerful idea to remove bias •  Pick 2 hash funcTons:

–  h : Just like in Count-‐Min hashing

–  ξ : Sign hash funcTon •  Removes the bias found in Count-‐Min hashing (see homework)

•  Define a “kernel”, a projecTon ϕ for x:

©Emily Fox 2015 8

4/7/15

5

Hash Kernels Preserve Dot Products

•  Hash kernels provide unbiased esTmate of dot-‐products!

•  Variance decreases as O(1/m)

•  Choosing m? For ε>0, if

–  Under certain condiTons… –  Then, with probability at least 1-‐δ:

©Emily Fox 2015 9

(1� ✏)||x� x

0||22 ||�(x)� �(x0)||22 (1 + ✏)||x� x

0||22

m = O log

N�

✏2

!

�i(x) =X

j:h(j)=i

⇠(j)xj

Learning With Hash Kernels •  Given hash kernel of dimension m, specified by h and ξ

–  Learn m dimensional weight vector •  Observe data point x

–  Dimension does not need to be specified a priori!

•  Compute ϕ(x): –  IniTalize ϕ(x) –  For non-‐zero entries j of xj:

•  Use normal update as if observaTon were ϕ(x), e.g., for LR using SGD:

©Emily Fox 2015 10

w(t+1)i w(t)

i + ⌘tn

��w(t)i + �i(x

(t))[y(t) � P (Y = 1|�(x(t)),w(t))]o

4/7/15

6

InteresTng ApplicaTon of Hash Kernels: MulT-‐Task Learning

•  Personalized click esTmaTon for many users: –  One global click predicTon vector w:

•  But…

–  A click predicTon vector wu per user u:

•  But…

•  MulT-‐task learning: Simultaneously solve mulTple learning related problems: –  Use informaTon from one learning problem to inform the others

•  In our simple example, learn both a global w and one wu per user: –  PredicTon for user u:

–  If we know liZle about user u:

–  Axer a lot of data from user u:

©Emily Fox 2015 11

Problems with Simple MulT-‐Task Learning

•  Dealing with new user is annoying, just like dealing with new words in vocabulary

•  Dimensionality of joint parameter space is HUGE, e.g. personalized email spam classificaTon from Weinberger et al.: –  3.2M emails –  40M unique tokens in vocabulary –  430K users –  16T parameters needed for personalized classificaTon!

©Emily Fox 2015 12

4/7/15

7

Hash Kernels for MulT-‐Task Learning •  Simple, preZy soluTon with hash kernels:

–  Very mulT-‐task learning as (sparse) learning problem with (huge) joint data point z for point x and user u:

•  EsTmaTng click probability as desired:

•  Address huge dimensionality, new words, and new users using hash kernels:

©Emily Fox 2015 13

Simple Trick for Forming ProjecTon ϕ(x,u) •  Observe data point x for user u

–  Dimension does not need to be specified a priori and user can be new!

•  Compute ϕ(x,u): –  IniTalize ϕ(x,u) –  For non-‐zero entries j of xj: •  E.g., j=‘Obamacare’ •  Need two contribuTons to ϕ:

–  Global contribuTon –  Personalized ContribuTon

•  Simply:

•  Learn as usual using ϕ(x,u) instead of ϕ(x) in update funcTon ©Emily Fox 2015 14

4/7/15

8

Results from Weinberger et al. on Spam ClassificaTon: Effect of m

©Emily Fox 2015 15

Results from Weinberger et al. on Spam ClassificaTon: MulT-‐Task Effect

©Emily Fox 2015 16

4/7/15

9

What you need to know •  Hash funcTons •  Bloom filter

–  Test membership with some false posiTves, but very small number of bits per element

•  Count-‐Min sketch –  PosiTve counts: upper bound with nice rates of convergence –  General case

•  ApplicaTon to logisTc regression •  Hash kernels:

–  Sparse representaTon for feature vectors –  Very simple, use two hash funcTon (Can use one hash funcTon…take least significant bit to define ξ) –  Quickly generate projecTon ϕ(x) –  Learn in projected space

•  MulT-‐task learning: –  Solve many related learning problems simultaneously –  Very easy to implement with hash kernels –  Significantly improve accuracy in some problems (if there is enough data from individual users)

©Emily Fox 2015 17

©Emily Fox 2015 18

Machine Learning for Big Data CSE547/STAT548, University of Washington

Emily Fox April 9th, 2015

Task DescripTon: Finding Similar Documents

Case Study 2: Document Retrieval

4/7/15

10

Document Retrieval

©Emily Fox 2015 19

n  Goal: Retrieve documents of interest n  Challenges:

¨ Tons of arTcles out there ¨ How should we measure similarity?

Task 1: Find Similar Documents

©Emily Fox 2015 20

n  To begin… ¨  Input: Query arTcle ¨ Output: Set of k similar arTcles

4/7/15

11

Document RepresentaTon

©Emily Fox 2015 21

n  Bag of words model

1-‐Nearest Neighbor

©Emily Fox 2015 22

n  ArTcles

n  Query:

n  1-‐NN ¨  Goal:

¨  FormulaTon:

4/7/15

12

k-‐Nearest Neighbor

©Emily Fox 2015 23

n  ArTcles

n  Query:

n  k-‐NN ¨  Goal:

¨  FormulaTon:

X = {x1, . . . , x

N}, x

i 2 Rd

x 2 Rd

Distance Metrics – Euclidean

©Emily Fox 2015 24 24

Other Metrics… n  Mahalanobis, Rank-‐based, CorrelaTon-‐based, cosine similarity…

where

Or, more generally,

⌃ =

2

6664

�21 0 · · · 00 �2

2 · · · 0...

... · · ·...

0 0 . . . �2d

3

7775

d(u, v) =p

(u� v)0⌃(u� v)

d(u, v) =

vuutdX

i=1

�2i (ui � vi)2

d(u, v) =

vuutdX

i=1

(ui � vi)2

Equivalently,

4/7/15

13

Notable Distance Metrics (and their level sets)

©Emily Fox 2015 25

L1 norm (absolute)

L∞ (max) norm

Scaled Euclidian (L2)

Mahalanobis (Σ is general sym pos def matrix, on previous slide = diagonal)

n  Recall distance metric

n  What if each document were Tmes longer? ¨  Scale word count vectors

¨  What happens to measure of similarity?

n  Good to normalize vectors

Euclidean Distance + Document Retrieval

©Emily Fox 2015 26 26

d(u, v) =

vuutdX

i=1

(ui � vi)2

↵

4/7/15

14

Issues with Document RepresentaTon

©Emily Fox 2015 27

n  Words counts are bad for standard similarity metrics

n  Term Frequency – Inverse Document Frequency (�-‐idf)

¨  Increase importance of rare words

TF-‐IDF

©Emily Fox 2015 28

n  Term frequency:

¨  Could also use n  Inverse document frequency:

n  �-‐idf:

¨  High for document d with high frequency of term t (high “term frequency”) and few documents containing term t in the corpus (high “inverse doc frequency”)

tf(t, d) =

{0, 1}, 1 + log f(t, d), . . .

idf(t,D) =

tfidf(t, d,D) =

4/7/15

15

n  Naïve approach: Brute force search ¨  Given a query point ¨  Scan through each point ¨  O(N) distance computaTons per

1-‐NN query! ¨  O(Nlogk) per k-‐NN query!

n  What if N is huge??? (and many queries)

Issues with Search Techniques

©Emily Fox 2015 29

33 Distance Computations

x

x

i

n  Smarter approach: kd-‐trees ¨  Structured organizaTon of documents

n  Recursively parTTons points into axis aligned boxes.

¨  Enables more efficient pruning of search space

n  Examine nearby points first. n  Ignore any points that are further than the nearest point found so far.

n  kd-‐trees work “well” in “low-‐medium” dimensions ¨  We’ll get back to this…

KD-‐Trees

©Emily Fox 2015 30

4/7/15

16

KD-‐Tree ConstrucTon

©Emily Fox 2015 31

Pt X Y

1 0.00 0.00 2 1.00 4.31 3 0.13 2.85 … … …

n  Start with a list of d-‐dimensional points.


©Emily Fox 2015 32

Pt X Y 1 0.00 0.00 3 0.13 2.85 … … …

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

n  Split the points into 2 groups by: ¨  Choosing dimension dj and value V (methods to be discussed…)

¨  SeparaTng the points into > V and <= V. x

idj

x

idj

4/7/15

17


©Emily Fox 2015 33

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

n  Consider each group separately and possibly split again (along same/different dimension). ¨  Stopping criterion to be discussed…

Pt X Y 1 0.00 0.00 3 0.13 2.85 … … …


©Emily Fox 2015 34

Pt X Y 3 0.13 2.85 … … …

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

Pt X Y 1 0.00 0.00 … … …

Y>.1 NO YES

n  Consider each group separately and possibly split again (along same/different dimension). ¨  Stopping criterion to be discussed…

4/7/15

18


©Emily Fox 2015 35

n  ConTnue spli�ng points in each set ¨  creates a binary tree structure

n  Each leaf node contains a list of points


©Emily Fox 2015 36

n  Keep one addiTonal piece of informaTon at each node: ¨  The (Tght) bounds of the points at or below this node.

4/7/15

19


©Emily Fox 2015 37

n  Use heurisTcs to make spli�ng decisions: n  Which dimension do we split along?

n  Which value do we split at?

n  When do we stop?

Many heurisTcs…

©Emily Fox 2015 38

median heurisTc center-‐of-‐range heurisTc

4/7/15

20

Nearest Neighbor with KD Trees

©Emily Fox 2015 39

n  Traverse the tree looking for the nearest neighbor of the query point.


©Emily Fox 2015 40

n  Examine nearby points first: ¨  Explore branch of tree closest to the query point first.

4/7/15

21


©Emily Fox 2015 41

n  Examine nearby points first: ¨  Explore branch of tree closest to the query point first.


©Emily Fox 2015 42

n  When we reach a leaf node: ¨  Compute the distance to each point in the node.

4/7/15

22


©Emily Fox 2015 43

n  When we reach a leaf node: ¨  Compute the distance to each point in the node.


©Emily Fox 2015 44

n  Then backtrack and try the other branch at each node visited

4/7/15

23


©Emily Fox 2015 45

n  Each Tme a new closest node is found, update the distance bound


©Emily Fox 2015 46

n  Using the distance bound and bounding box of each node: ¨  Prune parts of the tree that could NOT include the nearest neighbor

4/7/15

25

Complexity

©Emily Fox 2015 49

n  For (nearly) balanced, binary trees... n  ConstrucTon

¨  Size: ¨  Depth: ¨ Median + send points lex right: ¨  ConstrucTon Tme:

n  1-‐NN query ¨  Traverse down tree to starTng point: ¨ Maximum backtrack and traverse: ¨  Complexity range:

n  Under some assumpTons on distribuTon of points, we get O(logN) but exponenTal in d (see citaTons in reading)

Complexity

©Emily Fox 2015 50

4/7/15

26

Complexity for N Queries

©Emily Fox 2015 51

n  Ask for nearest neighbor to each document

n  Brute force 1-‐NN:

n  kd-‐trees:

InspecTons vs. N and d

©Emily Fox 2015 52

0 2000 4000 6000 8000 10000

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 150

100

200

300

400

500

600

0 2000 4000 6000 8000 10000

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 150

100

200

300

400

500

600

4/7/15

27

K-‐NN with KD Trees

©Emily Fox 2015 53

n  Exactly the same algorithm, but maintain distance as distance to furthest of current k nearest neighbors

n  Complexity is:

Approximate K-‐NN with KD Trees

©Emily Fox 2015 54

n  Before: Prune when distance to bounding box > n  Now: Prune when distance to bounding box > n  Will prune more than allowed, but can guarantee that if we return a neighbor at

distance , then there is no neighbor closer than . n  In pracTce this bound is loose…Can be closer to opTmal. n  Saves lots of search Tme at liZle cost in quality of nearest neighbor.

r/↵r

4/7/15

28

Wrapping Up – Important Points

©Emily Fox 2015 55

kd-‐trees n  Tons of variants

¨  On construcTon of trees (heurisTcs for spli�ng, stopping, represenTng branches…) ¨  Other representaTonal data structures for fast NN search (e.g., ball trees,…)

Nearest Neighbor Search n  Distance metric and data representaTon are crucial to answer returned For both… n  High dimensional spaces are hard!

¨  Number of kd-‐tree searches can be exponenTal in dimension n  Rule of thumb… N >> 2d… Typically useless.

¨  Distances are sensiTve to irrelevant features n  Most dimensions are just noise à Everything equidistant (i.e., everything is far away) n  Need technique to learn what features are important for your task

What you need to know

©Emily Fox 2015 56

n  Document retrieval task ¨ Document representaTon (bag of words) ¨ �-‐idf

n  Nearest neighbor search ¨ FormulaTon ¨ Different distance metrics and sensiTvity to choice ¨ Challenges with large N

n  kd-‐trees for nearest neighbor search ¨ ConstrucTon of tree ¨ NN search algorithm using tree ¨ Complexity of construcTon and query ¨ Challenges with large d

4/7/15

29

Acknowledgment

©Emily Fox 2015 57

n  This lecture contains some material from Andrew Moore’s excellent collecTon of ML tutorials: ¨ hZp://www.cs.cmu.edu/~awm/tutorials

n  In parTcular, see: ¨ hZp://grist.caltech.edu/sc4devo/.../files/sc4devo_scalable_datamining.ppt

Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2...

Documents

Transcript of Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2...