Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2...

29
4/7/15 1 ©Emily Fox 2015 1 Case Study 1: Es.ma.ng Click Probabili.es Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox April 9 th , 2015 Problem 1: Complexity of LR Updates LogisTc regression update: Complexity of updates: Constant in number of data points In number of features? Problem both in terms of computaTonal complexity and sample complexity What can we with very high dimensional feature spaces? Kernels not always appropriate, or scalable What else? ©Emily Fox 2015 2 w (t+1) i w (t) i + t n -λw (t) i + x (t) i [y (t) - P (Y =1|x (t) , w (t) )] o

Transcript of Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2...

Page 1: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

1  

©Emily  Fox  2015   1  

Case  Study  1:  Es.ma.ng  Click  Probabili.es  

Tackling  an  Unknown  Number  of  Features  with  Sketching    

Machine  Learning  for  Big  Data      CSE547/STAT548,  University  of  Washington  

Emily  Fox  April  9th,  2015  

Problem  1:  Complexity  of  LR  Updates  

•  LogisTc  regression  update:  

•  Complexity  of  updates:  –  Constant  in  number  of  data  points    –  In  number  of  features?  

•  Problem  both  in  terms  of  computaTonal  complexity  and  sample  complexity  

•  What  can  we  with  very  high  dimensional  feature  spaces?    –  Kernels  not  always  appropriate,  or  scalable  –  What  else?  

©Emily  Fox  2015   2  

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Page 2: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

2  

Problem  2:  Unknown  Number  of  Features  

•  For  example,  bag-­‐of-­‐words  features  for  text  data:  –  “Mary  had  a  liZle  lamb,  liZle  lamb…”  

•  What’s  the  dimensionality  of  x?  •  What  if  we  see  new  word  that  was  not  in  our  vocabulary?    

–  Obamacare  

–  TheoreTcally,  just  keep  going  in  your  learning,  and  iniTalize  wObamacare  =  0  –  In  pracTce,  need  to  re-­‐allocate  memory,  fix  indices,…  A  big  problem  for  Big  Data  

©Emily  Fox  2015   3  

•  Keep  p  by  m  Count  matrix    

•  p  hash  funcTons:    –  Just  like  in  Bloom  Filter,  decrease  errors  with  mulTple  hashes  –  Every  Tme  see  string  i:  

     

 ©Emily  Fox  2015   4  

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] + 1

Count-­‐Min  Sketch:  general  case  

Page 3: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

3  

Querying  the  Count-­‐Min  Sketch  

•  Query  Q(i)?    –  What  is  in  Count[j,k]?  

–  Thus:  

–  Return:  

©Emily  Fox  2015   5  

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] + 1

But  updates  may  be  posiTve  or  negaTve  

•  Count-­‐Min  sketch  for  posiTve  &  negaTve  case  –  ai  no  longer  necessarily  posiTve  

•  Update  the  same:  Observe  change  Δi  to  element  i:  

–  Each  Count[j,h(i)]  no  longer  an  upper  bound  on  ai  

•  How  do  we  make  a  predicTon?  

•  Bound:  –  With  probability  at  least  1-­‐δ1/4,  where  ||a||  =  Σi  |ai|      

©Emily  Fox  2015   6  

|ai � ai| 3✏||a||1

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

8j 2 {1, . . . , p} : Count[j, hj(i)] Count[j, hj(i)] +�i

Page 4: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

4  

Finally,  Sketching  for  LR  

•  Never  need  to  know  size  of  vocabulary!  •  At  every  iteraTon,  update  Count-­‐Min  matrix:  

•  Making  a  predicTon:  

•  Scales  to  huge  problems,  great  pracTcal  implicaTons…  ©Emily  Fox  2015   7  

w

(t+1)i w

(t)i + ⌘t

n

��w(t)i + x

(t)i [y(t) � P (Y = 1|x(t)

,w

(t))]o

Hash  Kernels  •  Count-­‐Min  sketch  not  designed  for  negaTve  updates  •  Biased  esTmates  of  dot  products  

•  Hash  Kernels:  Very  simple,  but  powerful  idea  to  remove  bias  •  Pick  2  hash  funcTons:  

–  h  :    Just  like  in  Count-­‐Min  hashing  

–  ξ  :  Sign  hash  funcTon  •  Removes  the  bias  found  in  Count-­‐Min  hashing  (see  homework)  

•  Define  a  “kernel”,  a  projecTon  ϕ  for  x:    

©Emily  Fox  2015   8  

Page 5: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

5  

Hash  Kernels  Preserve  Dot  Products  

 •  Hash  kernels  provide  unbiased  esTmate  of  dot-­‐products!  

•  Variance  decreases  as  O(1/m)  

•  Choosing  m?    For  ε>0,  if  

–  Under  certain  condiTons…  –  Then,  with  probability  at  least  1-­‐δ:  

©Emily  Fox  2015   9  

(1� ✏)||x� x

0||22 ||�(x)� �(x0)||22 (1 + ✏)||x� x

0||22

m = O log

N�

✏2

!

�i(x) =X

j:h(j)=i

⇠(j)xj

Learning  With  Hash  Kernels  •  Given  hash  kernel  of  dimension  m,  specified  by  h  and  ξ  

–  Learn  m  dimensional  weight  vector  •  Observe  data  point  x    

–  Dimension  does  not  need  to  be  specified  a  priori!  

•  Compute  ϕ(x):  –  IniTalize  ϕ(x)  –  For  non-­‐zero  entries  j  of  xj:  

•  Use  normal  update  as  if  observaTon  were  ϕ(x),  e.g.,  for  LR  using  SGD:    

©Emily  Fox  2015   10  

w(t+1)i w(t)

i + ⌘tn

��w(t)i + �i(x

(t))[y(t) � P (Y = 1|�(x(t)),w(t))]o

Page 6: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

6  

InteresTng  ApplicaTon  of  Hash  Kernels:  MulT-­‐Task  Learning  

•  Personalized  click  esTmaTon  for  many  users:  –  One  global  click  predicTon  vector  w:  

•  But…  

–  A  click  predicTon  vector  wu  per  user  u:  

•  But…  

•  MulT-­‐task  learning:  Simultaneously  solve  mulTple  learning  related  problems:  –  Use  informaTon  from  one  learning  problem  to  inform  the  others  

•  In  our  simple  example,  learn  both  a  global  w  and  one  wu  per  user:  –  PredicTon  for  user  u:  

–  If  we  know  liZle  about  user  u:  

–  Axer  a  lot  of  data  from  user  u:    

©Emily  Fox  2015   11  

Problems  with  Simple    MulT-­‐Task  Learning  

•  Dealing  with  new  user  is  annoying,  just  like  dealing  with  new  words  in  vocabulary  

•  Dimensionality  of  joint  parameter  space  is  HUGE,  e.g.  personalized  email  spam  classificaTon  from  Weinberger  et  al.:  –  3.2M  emails  –  40M  unique  tokens  in  vocabulary  –  430K  users  –  16T  parameters  needed  for  personalized  classificaTon!  

©Emily  Fox  2015   12  

Page 7: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

7  

Hash  Kernels  for  MulT-­‐Task  Learning  •  Simple,  preZy  soluTon  with  hash  kernels:  

–  Very  mulT-­‐task  learning  as  (sparse)  learning  problem  with  (huge)  joint  data  point  z  for  point  x  and  user  u:  

•  EsTmaTng  click  probability  as  desired:  

•  Address  huge  dimensionality,  new  words,  and  new  users  using  hash  kernels:    

©Emily  Fox  2015   13  

Simple  Trick  for  Forming  ProjecTon  ϕ(x,u)  •  Observe  data  point  x  for  user  u  

–  Dimension  does  not  need  to  be  specified  a  priori  and  user  can  be  new!  

•  Compute  ϕ(x,u):  –  IniTalize  ϕ(x,u)  –  For  non-­‐zero  entries  j  of  xj:  •  E.g.,  j=‘Obamacare’  •  Need  two  contribuTons  to  ϕ:  

–  Global  contribuTon  –  Personalized  ContribuTon  

•  Simply:  

 

•  Learn  as  usual  using  ϕ(x,u)  instead  of  ϕ(x)  in  update  funcTon  ©Emily  Fox  2015   14  

Page 8: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

8  

Results  from  Weinberger  et  al.  on  Spam  ClassificaTon:  Effect  of  m  

©Emily  Fox  2015   15  

Results  from  Weinberger  et  al.  on    Spam  ClassificaTon:  MulT-­‐Task  Effect  

©Emily  Fox  2015   16  

Page 9: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

9  

What  you  need  to  know  •  Hash  funcTons  •  Bloom  filter  

–  Test  membership  with  some  false  posiTves,  but  very  small  number  of  bits  per  element  

•  Count-­‐Min  sketch  –  PosiTve  counts:  upper  bound  with  nice  rates  of  convergence  –  General  case  

•  ApplicaTon  to  logisTc  regression  •  Hash  kernels:  

–  Sparse  representaTon  for  feature  vectors  –  Very  simple,  use  two  hash  funcTon  (Can  use  one  hash  funcTon…take  least  significant  bit  to  define  ξ)  –  Quickly  generate  projecTon  ϕ(x)  –  Learn  in  projected  space  

•  MulT-­‐task  learning:  –  Solve  many  related  learning  problems  simultaneously  –  Very  easy  to  implement  with  hash  kernels  –  Significantly  improve  accuracy  in  some  problems    (if  there  is  enough  data  from  individual  users)  

©Emily  Fox  2015   17  

©Emily  Fox  2015   18  

Machine  Learning  for  Big  Data      CSE547/STAT548,  University  of  Washington  

Emily  Fox  April  9th,  2015  

Task  DescripTon:    Finding  Similar  Documents  

Case  Study  2:  Document  Retrieval  

Page 10: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

10  

Document  Retrieval  

©Emily  Fox  2015   19  

n  Goal:  Retrieve  documents  of  interest    n  Challenges:    

¨ Tons  of  arTcles  out  there  ¨ How  should  we  measure  similarity?  

Task  1:  Find  Similar  Documents  

©Emily  Fox  2015   20  

n  To  begin…  ¨  Input:  Query  arTcle    ¨ Output:  Set  of  k  similar  arTcles  

Page 11: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

11  

Document  RepresentaTon  

©Emily  Fox  2015   21  

n  Bag  of  words  model  

1-­‐Nearest  Neighbor  

©Emily  Fox  2015   22  

n  ArTcles  

n  Query:    

n  1-­‐NN  ¨  Goal:    

¨  FormulaTon:  

Page 12: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

12  

k-­‐Nearest  Neighbor  

©Emily  Fox  2015   23  

n  ArTcles  

n  Query:    

n  k-­‐NN  ¨  Goal:    

¨  FormulaTon:  

X = {x1, . . . , x

N}, x

i 2 Rd

x 2 Rd

Distance  Metrics  –  Euclidean    

©Emily  Fox  2015   24  24

Other  Metrics…  n  Mahalanobis,  Rank-­‐based,  CorrelaTon-­‐based,  cosine  similarity…    

where  

Or,  more  generally,  

⌃ =

2

6664

�21 0 · · · 00 �2

2 · · · 0...

... · · ·...

0 0 . . . �2d

3

7775

d(u, v) =p

(u� v)0⌃(u� v)

d(u, v) =

vuutdX

i=1

�2i (ui � vi)2

d(u, v) =

vuutdX

i=1

(ui � vi)2

Equivalently,  

Page 13: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

13  

Notable  Distance  Metrics    (and  their  level  sets)  

©Emily  Fox  2015   25  

L1  norm  (absolute)  

L∞  (max)  norm  

Scaled  Euclidian  (L2)  

Mahalanobis                      (Σ  is  general  sym  pos  def  matrix,  on  previous  slide  =  diagonal)  

n  Recall  distance  metric    

n  What  if  each  document  were              Tmes  longer?  ¨  Scale  word  count  vectors  

¨  What  happens  to  measure  of  similarity?    

n  Good  to  normalize  vectors  

Euclidean  Distance  +  Document  Retrieval  

©Emily  Fox  2015   26  26

d(u, v) =

vuutdX

i=1

(ui � vi)2

Page 14: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

14  

Issues  with  Document  RepresentaTon  

©Emily  Fox  2015   27  

n  Words  counts  are  bad  for  standard  similarity  metrics  

       n  Term  Frequency  –  Inverse  Document  Frequency  (�-­‐idf)  

¨  Increase  importance  of  rare  words  

TF-­‐IDF  

©Emily  Fox  2015   28  

n  Term  frequency:  

¨  Could  also  use    n  Inverse  document  frequency:  

n  �-­‐idf:  

¨  High  for  document  d  with  high  frequency  of  term  t  (high  “term  frequency”)  and  few  documents  containing  term  t  in  the  corpus  (high  “inverse  doc  frequency”)  

tf(t, d) =

{0, 1}, 1 + log f(t, d), . . .

idf(t,D) =

tfidf(t, d,D) =

Page 15: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

15  

n  Naïve  approach:    Brute  force  search  ¨  Given  a  query  point  ¨  Scan  through  each  point  ¨  O(N)  distance  computaTons  per  

1-­‐NN  query!  ¨  O(Nlogk)  per  k-­‐NN  query!  

   

n  What  if  N  is  huge???  (and  many  queries)  

 

Issues  with  Search  Techniques  

©Emily  Fox  2015   29  

33 Distance Computations

x

x

i

n  Smarter  approach:  kd-­‐trees  ¨  Structured  organizaTon  of  documents  

n  Recursively  parTTons  points  into  axis  aligned  boxes.  

¨  Enables  more  efficient  pruning  of  search  space  

n  Examine  nearby  points  first.  n  Ignore  any  points  that  are  further  than  the  nearest  point  found  so  far.  

n  kd-­‐trees  work  “well”  in  “low-­‐medium”  dimensions  ¨  We’ll  get  back  to  this…  

KD-­‐Trees  

©Emily  Fox  2015   30  

Page 16: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

16  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   31  

Pt X Y

1 0.00 0.00 2 1.00 4.31 3 0.13 2.85 … … …

n  Start  with  a  list  of  d-­‐dimensional  points.  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   32  

Pt X Y 1 0.00 0.00 3 0.13 2.85 … … …

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

n  Split  the  points  into  2  groups  by:  ¨  Choosing  dimension  dj  and  value  V  (methods  to  be  discussed…)  

¨  SeparaTng  the  points  into              >  V  and              <=  V.  x

idj

x

idj

Page 17: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

17  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   33  

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

n  Consider  each  group  separately  and  possibly  split  again  (along  same/different  dimension).  ¨  Stopping  criterion  to  be  discussed…  

Pt X Y 1 0.00 0.00 3 0.13 2.85 … … …

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   34  

Pt X Y 3 0.13 2.85 … … …

X>.5

Pt X Y 2 1.00 4.31 … … …

YES NO

Pt X Y 1 0.00 0.00 … … …

Y>.1 NO YES

n  Consider  each  group  separately  and  possibly  split  again  (along  same/different  dimension).  ¨  Stopping  criterion  to  be  discussed…  

Page 18: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

18  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   35  

n  ConTnue  spli�ng  points  in  each  set    ¨  creates  a  binary  tree  structure  

n  Each  leaf  node  contains  a  list  of  points  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   36  

n  Keep  one  addiTonal  piece  of  informaTon  at  each  node:  ¨   The  (Tght)  bounds  of  the  points  at  or  below  this  node.  

Page 19: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

19  

KD-­‐Tree  ConstrucTon  

©Emily  Fox  2015   37  

n  Use  heurisTcs  to  make  spli�ng  decisions:  n  Which  dimension  do  we  split  along?    

n  Which  value  do  we  split  at?      

n  When  do  we  stop?        

Many  heurisTcs…  

©Emily  Fox  2015   38  

median  heurisTc   center-­‐of-­‐range  heurisTc  

Page 20: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

20  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   39  

n  Traverse  the  tree  looking  for  the  nearest  neighbor  of  the  query  point.  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   40  

n  Examine  nearby  points  first:    ¨  Explore  branch  of  tree  closest  to  the  query  point  first.  

Page 21: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

21  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   41  

n  Examine  nearby  points  first:    ¨  Explore  branch  of  tree  closest  to  the  query  point  first.  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   42  

n  When  we  reach  a  leaf  node:    ¨  Compute  the  distance  to  each  point  in  the  node.  

Page 22: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

22  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   43  

n  When  we  reach  a  leaf  node:    ¨  Compute  the  distance  to  each  point  in  the  node.  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   44  

n  Then  backtrack  and  try  the  other  branch  at  each  node  visited  

Page 23: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

23  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   45  

n  Each  Tme  a  new  closest  node  is  found,  update  the  distance  bound  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   46  

n  Using  the  distance  bound  and  bounding  box  of  each  node:  ¨  Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor  

Page 24: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

24  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   47  

n  Using  the  distance  bound  and  bounding  box  of  each  node:  ¨  Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor  

Nearest  Neighbor  with  KD  Trees  

©Emily  Fox  2015   48  

n  Using  the  distance  bound  and  bounding  box  of  each  node:  ¨  Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor  

Page 25: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

25  

Complexity  

©Emily  Fox  2015   49  

n  For  (nearly)  balanced,  binary  trees...  n  ConstrucTon  

¨  Size:  ¨  Depth:    ¨ Median  +  send  points  lex  right:  ¨  ConstrucTon  Tme:    

n  1-­‐NN  query  ¨  Traverse  down  tree  to  starTng  point:  ¨ Maximum  backtrack  and  traverse:  ¨  Complexity  range:  

n  Under  some  assumpTons  on  distribuTon  of  points,  we  get  O(logN)  but  exponenTal  in  d  (see  citaTons  in  reading)  

Complexity  

©Emily  Fox  2015   50  

Page 26: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

26  

Complexity  for  N  Queries  

©Emily  Fox  2015   51  

n  Ask  for  nearest  neighbor  to  each  document  

n  Brute  force  1-­‐NN:  

n  kd-­‐trees:  

InspecTons  vs.  N  and  d  

©Emily  Fox  2015   52  

0 2000 4000 6000 8000 10000

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 150

100

200

300

400

500

600

0 2000 4000 6000 8000 10000

10

20

30

40

50

60

70

80

1 3 5 7 9 11 13 150

100

200

300

400

500

600

Page 27: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

27  

K-­‐NN  with  KD  Trees  

©Emily  Fox  2015   53  

n  Exactly  the  same  algorithm,  but  maintain  distance  as  distance  to  furthest  of  current  k  nearest  neighbors  

n  Complexity  is:  

Approximate  K-­‐NN  with  KD  Trees  

©Emily  Fox  2015   54  

n  Before:  Prune  when  distance  to  bounding  box  >    n  Now:  Prune  when  distance  to  bounding  box  >    n  Will  prune  more  than  allowed,  but  can  guarantee  that  if  we  return  a  neighbor  at  

distance        ,  then  there  is  no  neighbor  closer  than                      .  n  In  pracTce  this  bound  is  loose…Can  be  closer  to  opTmal.  n  Saves  lots  of  search  Tme  at  liZle  cost  in  quality  of  nearest  neighbor.  

r/↵r

Page 28: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

28  

Wrapping  Up  –  Important  Points  

©Emily  Fox  2015   55  

kd-­‐trees  n  Tons  of  variants  

¨  On  construcTon  of  trees  (heurisTcs  for  spli�ng,  stopping,  represenTng  branches…)  ¨  Other  representaTonal  data  structures  for  fast  NN  search  (e.g.,  ball  trees,…)  

 Nearest  Neighbor  Search  n  Distance  metric  and  data  representaTon  are  crucial  to  answer  returned    For  both…  n  High  dimensional  spaces  are  hard!  

¨  Number  of  kd-­‐tree  searches  can  be  exponenTal  in  dimension  n  Rule  of  thumb…    N  >>  2d…  Typically  useless.  

¨  Distances  are  sensiTve  to  irrelevant  features    n  Most  dimensions  are  just  noise  à  Everything  equidistant  (i.e.,  everything  is  far  away)  n  Need  technique  to  learn  what  features  are  important  for  your  task  

What  you  need  to  know  

©Emily  Fox  2015   56  

n  Document  retrieval  task  ¨ Document  representaTon  (bag  of  words)  ¨ �-­‐idf  

n  Nearest  neighbor  search  ¨ FormulaTon  ¨ Different  distance  metrics  and  sensiTvity  to  choice  ¨ Challenges  with  large  N  

n  kd-­‐trees  for  nearest  neighbor  search  ¨ ConstrucTon  of  tree  ¨ NN  search  algorithm  using  tree  ¨ Complexity  of  construcTon  and  query  ¨ Challenges  with  large  d  

Page 29: Tackling&an&Unknown&Number&of& Features&with&Sketching&& · 4/7/15 2 Problem&2:&Unknown&Number&of&Features& • For&example,&bagXofXwords&features&for&textdata:& – “Mary&had&aliZle&lamb,&liZle&lamb…”&

4/7/15  

29  

Acknowledgment  

©Emily  Fox  2015   57  

n  This  lecture  contains  some  material  from  Andrew  Moore’s  excellent  collecTon  of  ML  tutorials:  ¨ hZp://www.cs.cmu.edu/~awm/tutorials  

n  In  parTcular,  see:  ¨ hZp://grist.caltech.edu/sc4devo/.../files/sc4devo_scalable_datamining.ppt