Feeback:(Clustering(and(k Means - Marina...

10
1 Feeback: Clustering and kMeans Question: If we want to find a market segment to sell children’s clothes, why don't we simply look at records that contain children’s number >0 and target these people? Or we might find the intersection of marital status and children and target these people? You can do it, of course: that’s traditional programming. if you have database, you could simply write a (sql) query and extract all the relevant records that meet your criteria. This feasibility does not necessarily imply efficacy. You might have hidden patterns in your data that you overlook using this traditional approach. Remember what we said at the beginning of the course (lect 1):

Transcript of Feeback:(Clustering(and(k Means - Marina...

Page 1: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  1  

Feeback:  Clustering  and  k-­‐Means  Question:   If   we   want   to   find   a  market   segment   to   sell   children’s   clothes,   why   don't   we  simply   look  at  records  that  contain  children’s  number  >0  and  target  these  people?  Or  we  might  find  the  intersection  of  marital  status  and  children  and  target  these  people?  

You  can  do  it,  of  course:  that’s  traditional  programming.  if  you  have  database,  you  could  simply  write  a  (sql)  query  and  extract  all  the  relevant  records  that  meet  your  criteria.    

This  feasibility  does  not  necessarily  imply  efficacy.  You  might  have  hidden  patterns  in  your  data  that  you  overlook  using  this  traditional  approach.    

Remember  what  we  said  at  the  beginning  of  the  course  (lect  1):    

   

 

 

Page 2: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  2  

   

Supervised   and   Unsupervised   Machine  learning  algorithms  can  be  both  used  for  classification  and  data  exploration.  

 

   

Page 3: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  3  

Preliminaries  

àk-­‐means  &  nominal/numeric  attributes  

Generally   speaking,   k-­‐Means   is   often   implemented   for  numerical   attributes.  Usually,   if  you  have  nominal  attribute  in  the  dataset,  it  is  common  practice  to  binarize  them.  

But  …  apparently  Weka’s  implementation  can  handle  nominal  data.  For  computing  the   centroids,   weka   computes   the  mean   (numeric   attributes)   or   the  mode   (nominal  attribute).  [math  course:  the  mode  is  a  measure  of  central  tendency  :  mode  is  value  that  appears  most  often).      

In   real   life,   convert   nominal   data   into   binary,   because  nominal   attributes  might   cause  weird  results1.  But  for  this  exercise  we  trust  what  weka  does.    

ATT.   :   always  check  how  weka   implements  a  ML  algorithm:  hover  on   the  name  of   the  algorithm  in  the  Classify  tab  or  in  the  Cluster  tab.    

k-­‐Means  Review  

The  most  commonly  used  clustering  strategy  is  based  on  the  square-­‐root  error  criterion.    Objective:   to   minimize   the   square-­‐error   where   square-­‐error   is   the   sum   of   the  Euclidean   distances   (or   any   other   distance   you   decide)   between   each  instance/observation  and   its  cluster  center.  The  sum  of  squared  error  (SSE2)   indicates  how   compact   a   cluster   is:   the   lower   the   value,   the  better3.   Conversely,   the   larger   the  inter-­‐cluster  distance,  the  better.  In  short,  the  smaller  the  intra-­‐cluster  distance  and  the  bigger  the  inter-­‐cluster  distance,  the  better  the  clustering  will  be.    

                                                                                                               1  See  Witten  et  al.  (2011:  480-­‐481)  and  also  http://stackoverflow.com/questions/28396974/weka-­‐simple-­‐k-­‐means-­‐handling-­‐nominal-­‐attributes    2  See   also   <   https://hlab.stanford.edu/brian/error_sum_of_squares.html   >   Please   read   this   very  interesting  thread  in  the  weka  mailing  list:  some  of  the  answers  describe  approaches  to  how  to  minimize  the  within  cluster  sum  of  squared  errors.  Remember  that  there  are  many  empirical  ways  (or  euristics)  to  make  sense  of  the  data  you  have:    <  http://weka.8497.n7.nabble.com/Ignore-­‐the-­‐class-­‐td33195.html  >.    2  See   also   <   https://hlab.stanford.edu/brian/error_sum_of_squares.html   >   Please   read   this   very  interesting  thread  in  the  weka  mailing  list:  some  of  the  answers  describe  approaches  to  how  to  minimize  the  within  cluster  sum  of  squared  errors.  Remember  that  there  are  many  empirical  ways  (or  euristics)  to  make  sense  of  the  data  you  have:    <  http://weka.8497.n7.nabble.com/Ignore-­‐the-­‐class-­‐td33195.html  >.    3  See   also   <   http://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/1999/clustering/node17.html   >.  Remember,    when  attributes  are  not  on  comparable  scales,  you  might  want  to  apply  the  normalize  or  the  standardize   filters   (depending   on  what   is  more   appropriate   for   your   data).    Wehn   using   the   Eucledian  distance  if  you  do  not  normalize/standardise  your  data  then  the  variables  measured  in  large  valued  units  will   dominate   the   computed   dissimilarity   and   variables   that   are   measured   in   small   valued   units   will  contribute  very  little.  

Page 4: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  4  

 

 

While  in  hierarchical  clustering  an  explicit  measure  of  inter-­‐cluster  distance  is  provided  (the   linkage   type),   in   the  weka   implementation  of  k-­‐means   the   sum  of   square  error   is  given,  which  measures  the  distance  to  the  centroid.  As  the  objective  is  to  minimize  this  error,  it  can  be  considered  equialent  to  the  intra-­‐cluster  distance.    

The   algorithm   assesses   each   instance/observation,  moving   it   into   the   nearest   cluster.  The  nearest  cluster  is  the  one  which  has  the  smallest  Euclidean  distance  (or  any  other  distance   metric   that   you   decide)   between   the   observation   and   the   centroid   of   the  cluster.    If  you  have  nominal  data,  weka  will  compute  the  mode  (not  the  mean),  but  other  implementation  might  not  handle  nominal  data…    

Centroid:   K-­‐means   procedures   work   best   when   you   provide   good   initial   points   for  clusters  (it  is  hard  to  know  what  is  good  initial  point,  there  are  several  techniques  to  find  out;  in  weka  the  initial  point  is  defined  by  the  parameter  random  seed  (default  10)).  

Also  the  order  of  the  instances  of  the  dataset  has  an  impact  on  the  final  clustering  solutions   (do  you  remember   that  perceptron  had  a  similar  problem?).  Weka  provides  the  option  preserveInstancesOrder,  default  value  False.    

When  a  cluster  changes  (ie  it  loses  or  gain  an  istance/observation,  cluster  centroids  are  recalculated.  

This  process  repeats  until  no  more  istances/observations  can  be  moved  into  a  different  cluster.   At   this   point,   all   observations   are   in   their   nearest   cluster   by   the   previous  criterion.  

Difference   with   hierarchical   clustering:   Clusters   in   hierarchical   clustering   cannot  change.  With  k-­‐means,  on  the  contrary,  it  is  possible  for  two  observations  to  be  split  into  separate  clusters  after  they  are  joined  together.    

When  all  the  instances/observations  have  been  assigned,  the  final  Sum  of  Squared  error  is  computed…  

   

Page 5: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  5  

The  customer  dataset  

Mapping   clusters   to   classes   DOES  make   sense   also   in   an   exploratory/descriptive  scenario.  

 

 

Cluster  Mode:  Use  training  set  

===  Run  information  ===  

Scheme:weka.clusterers.SimpleKMeans  -­‐N  5  -­‐A  "weka.core.EuclideanDistance  -­‐R  first-­‐last"  -­‐I  500  -­‐S  10  Relation:          customers  Instances:        9  Attributes:      5                              income                              age                              children                              marital_status                              education    Test  mode:evaluate  on  training  data    

Page 6: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  6  

 

According  to  the  position  of  the  attributes   in  the  dataset,  we  could  say  that  education  would  be  the  default  class  of  this  dataset  if  we  were  doing  supervised  classification.    But  in  this  run  both  education  and  marital  status  are  simply  taken  as  attributes.  

Remember  in  the  USA:  College  refers  to  undergraduate  studies.  If  you  want  to  go  on  for  a  Master's  of  Doctorate,  you  would  go  to  graduate  school.  

(digression:  watch  out:  what  said   in   the   ian  witten’s   tutorial  about   the  class  of   the   iris  dataset  can  be  a  bit  misleading  here:  he  suggestes  to  ignore  the  class  or  to  use  the  class  to  evaluate  the  clusters.  You  can  do  this  when  you  are  doing  ”classification”:  he  wants  to  classify   the   iris  dataset  using  an  unsupervised   learning  algorithm.  He  wants   to  put   the  irisis  flowers  into  3  separated  clusters  that  make  sense.  Here,  with  the  customer  dataset,  we  want  to  explore  possible  groupings  that  make  sense  to  focus  our  marketing  campain.  We  are  doing  exploration  and  not  classification.  We  consider  all   the  attributes  equally  important  and  we  do  not  remove  any  of  them.-­‐-­‐-­‐  end  of  digression).    

Page 7: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  7  

 

So,   for   the   customer   dataset   we   take   all   the   attributes   into   account   for   an   initial  exploration   of   our   problem.   We   know   that   we   want   5   clusters,   and   the   algorithm  computes  the  cluster  assignment  based  on  this  request  and  on  the  default  random  seeds.    

We   do   not   know   how   compact   our   clusters:   SSE:   5.1;   iterations:   3   -­‐-­‐-­‐   uhm…   lets   try  something   else.     not   too   bad,   but   I   am   not   convinced   about   cluster   3,   its   income   is  unconvincing.    

If  we  change  the  random  seeds  to  100,  we  get:  squared  errors:  2.6  and  iterations:  2  

If  we  change  the  random  seeds  to  1000,  we  get:  squared  errors:  3.0  and  iterations:  2  

We  make  several   tries  because  we  are  EXPLORING  the  best  cluster  solutions  based  on  our   distance   metric   and   we   think   that   setting   random   seeds   to   100,   and   sumo   f  squared  errors:  2.6  and  iterations:  2.  

 

 

We  look  at  the  clusters  and  we  see  that  in  cluster  0,  we  have  one  instance  of  a  customer  who  has  an  income  of  200  000  dollars,  and  avarage  age  of  45,  and  5  children,  married    and  attended  graduate  school.    This  cluster  contains  rich  people,  highly  educated,    with  a  high  number  of  children,  so  potentially  could  be  of  interest  for  our  campain:  they  have  money,  lots  of  children,  they  might  want  to  buy  quality/expensive  clothes.    

Page 8: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  8  

You  can  also  notice   that   cluster  1  has  a  decent   income,   they  are  married,  but  have  no  children,  and  they  are  quite  old.    

Cluster  3  is  made  of  single  people  with  children,  and  very  low  income.    

and  so  on.    

We   have   unveilded   some   patters   that   are   plausible   and   we   can   guess   some   bying  behaviours,   such  as  we  might   focus  a   selling   campain   targeted   to  well-­‐off  people  with  expensive   outdoor   children’s   clothes   for   example   (cluster   0),   we   can   focus   another  campain  for  low  income  single  parents,  who  need  to  buy  budget  clothes  (cluster3).    

All   the  clusters,  except  cluster  1,  have  different  ”mean”  of  children  associatated  with  a  different   income   and  with   a   different  marital   status   and   education.   This   can   have   an  impact  on  the  buying  behavious.    

This  patter,  if  we  think  it  is  reliable,  allows  us  make  a  fine-­‐grained  selling  campaing,  not  a   one-­‐size-­‐fits-­‐all   campaign,  which   can  be   less   rewarding  or   less   effectiv   .   Even  more!  Althouth,   the   purpose   that  we   have   is   ”We   will   target   the   advertising   only   to   the  persons  with  young  children.  “  Creatively,  we  might  target  another  selling  campain  to  people  who  do  not  have  children,  who  are  on  the  verge  of  retirement  and  stimulate  them  to  contribute   to  charity  by  ”dress   the  undressed:  buy  a   t-­‐shirt   for  a  distressed  child   in  underdeveloped  countries”  (I  am  just  inventing).    

What  is  important  here  is  to  see  that  you  have  discovered  patters  that  can  be  useful  to  create   a   fine-­‐grained   and   customized   selling   campaign   that   you   would   not   have  discovered  just  by  targeting    people  with  children  and/or  being  married.  We  get  a  model  that  might  be  useful  to  make  sense  of  unseen/future  data.  We  learn  from  what  we  have,  a  model  is  created  that  generalize  on  actual  data  and  is  (hopefully)  capable  of  mke  sense  of  data  that  we  have  not  seen  yet.  So  it  is  a  way  to  have  an  eye  into  the  future.  Something  that   cannot   be   done   with   traditional   programmig   or   static   modelling   like   classical  database   approch  where   you   extract  with   an   SQL  query   only   records   that  meet   some  requirements,    such  has  having  children  or  being  married.  Such  a  solution  is  possible,  it  might  be  appropriate  in  certain  cases  but  it  is  very  short-­‐sighted  and  might  not  give  the  selling  results  you  expect.    

We  do  not  know  if  these  clusters  make  sense  or  not   in  the  real   life,   for  our  purpose.  It  depends  on  our  knowledge  of  the  fields,  on  our  intuition,  etc.    

What  we  could  do  since  we  have  the  possibility  to  do  so  it  to  use  the  nominal  attributes  to   get   further   insights.   Weka   allows   u   sto   use   the   attributes   as   classes   to   evaluate  against.  We  can  do  this  as  part  of  our  unsupervised  exploratory  analysis  of  our  data.    

Task  3  asked  to  use  ”marital  status”  to  evaluate  the  clusters.  Why  this  can  be  intresting?  What  does  this  tell  us?  

 

Page 9: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  9  

 

marital  status  attribute  is  not  taken  into  account  when  building  the  model.    

small   sum   of   squared   errors:   Number   of   iterations:   2;  Within   cluster   sum   of   squared  errors:  1.3  (value  that  is  different  )  

we  get  a  slight  different  pattern  (but  cluster  0  is  confirmed).    

how  is  the  class  assigned  to  clusters?  let’s  learn  how  to  read  the  output.  

Classes  to  clusters  evaluation.  In  this  mode  Weka  first  ignores  the  class  attribute  and  generates   the   clustering.   Then   during   the   test   phase   it   assigns   classes   to   the   clusters,  based   on   the   majority   value   of   the   class   attribute   within   each   cluster.   Then   it  computes   the   classification   error,   based   on   this   assignment   and   also   shows   the  corresponding  confusion  matrix.  

What  we   can   understand   from   this   clustering   solution   is   that   those  who   are  married  (cluster   4)   tend   to   have   fewer   children   and   a   higher   income   than   those   who   are    divorced  (cluster  2)  and  single  (cluster  3).  Conversely,  those  who  are  divorced  or  single  have  quite  many  children  but  a   lower   income.  All   the  members  of   these  3  clusters  are  relatively  young  (betw  25  and  35),  etc  

you  can  see  in  the  confusion  matrix  that  cluster  0  has  been  assigned  to  married  (exactly  as   in   the   previous   run),   but   this   info   is   ignored   in   the   final   calculation   of   the  classification   error   because   it   is   not   the   majority   class   of   the  marital   status   attribute  within   each   cluster.   also   cluster   1   has   been   assingned   to   married   and   divorced.   Both  cluster  0  and  1  have  children  and  high  income.    

Again  this  patterns  gives  us  idea  on  how  to  tailor  a  marketing  campain  to  different  slices  of  the  market,  this  time  we  give  special  importance  to  the  relation  between  the  marital  status   and   the   potential   of   spending  money   to   buy   children   clothes.  Maybe   the   initial  

Page 10: Feeback:(Clustering(and(k Means - Marina Santinisantini.se/teaching/ml/2016/Lect_08/__feedbackClusteringKMeans.pdf · ! 1! Feeback:(Clustering(and(k!Means" Question:!If#wewant#to#find#a#market#segment#to#sell#children’s#clothes,#whydon't#we

  10  

assumption   was   that   married   people   with   a   medium   income   (cluster   3)   might   be  inclined   to   spend  more  more   in   clothes,   but   this   pattern   shows   us   they   tend   to   have  fewer   children,   so   in   practical   terms   they   are   going   to   spend   less   money   all   in   all.  Therefore   it   might   be   good   to   set   up   selling   campaigns   and   special   offers   for   single  parents  with  many  children  (just  inventing  J  )  

In  conclusion,  we  can  say  that  also  when  exploring  data,  it  might  be  informative  to  map  the  clusters  into  the  classes  that  we  have.  

This  mapping   is  not  done   to  measure   the   goodness  of   our   classification  model,   but   to  gain  more  insight  into  the  possible  relations  and  correlation  existing  in  the  data.    

as  Ian  Witten  said,  it  is  a  kind  of  black  magic.    

One   important   thing   is   to   be   rememberd:   if   you   have   strong   patterns   in   data,   these  patterns  will  surface  in  all  kind  of  manipulations.    

With  the  junk  dataset,  on  the  other  hand,  we  evaluate  the  clusters  against  the  classes,  we  possibly  aim  at  the  evaluation  the  goodness  of  our  unsupervised  model  against  correct  classification:  

with   default   value,   we   get   the   following   performance,   which   is   not   too   bad   for   an  unsupervised  model.    

 

The  underpinnigs:  repetition  

• exploration/description  vs  classification/prediction  • discover  patterns  that  are  not  evident  and  that  cannot  be  unveiled  with  traditional  

methods  • build  a  model  to  capture  unseen  or  future  cases  • describe   and   interpret   data   data   using   with   the   use   of   underlying   mathematical  

models  • we  do  not  set  any  limits  to  what  we  can  do  with  data  we  have,  provided  that  it  makes  

sense:  we  can  basically  use  all  the  options  and  facilities  that  weka  allow  us  to  use  to  gain  insight  into  the  data.