Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

41
Kansas State University Department of Computing and Information Sciences 732: Machine Learning and Pattern Recognition Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/Courses/Spring-2008/CIS732 Readings: Section 7.1 – 7.3, Han & Kamber 2e Partitioning-Based Clustering and Expectation Maximization (EM) Lecture 21 of 42 Lecture 21 of 42

description

Lecture 21 of 42. Partitioning-Based Clustering and Expectation Maximization (EM). Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU http://www.cis.ksu.edu/Courses/Spring-2008/CIS732 Readings: Section 7.1 – 7.3, Han & Kamber 2e. Expectation Step - PowerPoint PPT Presentation

Transcript of Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Page 1: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Monday, 10 March 2008

William H. Hsu

Department of Computing and Information Sciences, KSUhttp://www.cis.ksu.edu/Courses/Spring-2008/CIS732

Readings:

Section 7.1 – 7.3, Han & Kamber 2e

Partitioning-Based Clusteringand Expectation Maximization (EM)

Lecture 21 of 42Lecture 21 of 42

Page 2: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

EM Algorithm:EM Algorithm:Example [3]Example [3]

• Expectation Step

– Suppose we observed m actual experiments, each n coin flips long

• Each experiment corresponds to one choice of coin (~)

• Let h denote the number of heads in experiment xi (a single data point)

– Q: How did we simulate the “fictional” data points, E[ log P(x | )]?

– A: By estimating (for 1 i m, i.e., the real data points)

• Maximization Step

– Q: What are we updating? What objective function are we maximizing?

– A: We are updating to maximize where

h-nhh-nh

h-nh

i

ii

qqαppα

ppα

xP

CoinP Coin |xPx | CoinP

ˆˆˆˆˆˆ

ˆˆˆ

1-1-1-

1-

11 1

q ,p ,α ˆˆˆq

E ,

p

E ,

α

Eˆˆˆ

m

ii q ,p ,α |xPE

1

logE ˆˆˆ

i

ii

i

ii

i

x | CoinP

x | CoinPnh

q ,x | CoinP

x | CoinPnh

p ,m

x | CoinPα

11-

11-

1

11ˆˆˆ

q ,p ,α ˆˆˆ

Page 3: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

EM for Unsupervised LearningEM for Unsupervised Learning

• Unsupervised Learning Problem

– Objective: estimate a probability distribution with unobserved variables

– Use EM to estimate mixture policy (more on this later; see 6.12, Mitchell)

• Pattern Recognition Examples

– Human-computer intelligent interaction (HCII)

• Detecting facial features in emotion recognition

• Gesture recognition in virtual environments

– Computational medicine [Frey, 1998]

• Determining morphology (shapes) of bacteria, viruses in microscopy

• Identifying cell structures (e.g., nucleus) and shapes in microscopy

– Other image processing

– Many other examples (audio, speech, signal processing; motor control; etc.)

• Inference Examples

– Plan recognition: mapping from (observed) actions to agent’s (hidden) plans

– Hidden changes in context: e.g., aviation; computer security; MUDs

Page 4: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:AutoClassAutoClass [1] [1]

• Bayesian Unsupervised Learning

– Given: D = {(x1, x2, …, xn)} (vectors of indistingushed attribute values)

– Return: set of class labels that has maximum a posteriori (MAP) probability

• Intuitive Idea

– Bayesian learning:

– MDL/BIC (Occam’s Razor): priors P(h) express “cost of coding” each model h

– AutoClass

• Define mutually exclusive, exhaustive clusters (class labels) y1, y2, …, yJ

• Suppose: each yj (1 j J) contributes to x

• Suppose also: yj’s contribution ~ known pdf, e.g., Mixture of Gaussians (MoG)

• Conjugate priors: priors on y of same form as priors on x

• When to Use for Clustering

– Believe (or can assume): clusters generated by known pdf

– Believe (or can assume): clusters combined using finite mixture (later)

hPh|DPmaxargD|hPmaxarghHhHh

MAP

Page 5: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:AutoClassAutoClass [2] [2]

• AutoClass Algorithm [Cheeseman et al, 1988]

– Based on maximizing P(x | j, yj, J)

j: class (cluster) parameters (e.g., mean and variance)

• yj : synthetic classes (can estimate marginal P(yj) any time)

– Apply Bayes’s Theorem, use numerical BOC estimation techniques (cf. Gibbs)

– Search objectives

• Find best J (ideally: integrate out j, yj; really: start with big J, decrease)

• Find j, yj: use MAP estimation, then “integrate in the neighborhood” of yMAP

• EM: Find MAP Estimate for P(x | j, yj, J) by Iterative Refinement

• Advantages over Symbolic (Non-Numerical) Methods

– Returns probability distribution over class membership

• More robust than “best” yj

• Compare: fuzzy set membership (similar but probabilistically motivated)

– Can deal with continuous as well as discrete data

Page 6: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:AutoClassAutoClass [3] [3]

• AutoClass Resources

– Beginning tutorial (AutoClass II): Cheeseman et al, 4.2.2 Buchanan and Wilkins

– Project page: http://ic-www.arc.nasa.gov/ic/projects/bayes-group/autoclass/

• Applications

– Knowledge discovery in databases (KDD) and data mining

• Infrared astronomical satellite (IRAS): spectral atlas (sky survey)

• Molecular biology: pre-clustering DNA acceptor, donor sites (mouse, human)

• LandSat data from Kansas (30 km2 region, 1024 x 1024 pixels, 7 channels)

• Positive findings: see book chapter by Cheeseman and Stutz, online

– Other typical applications: see KD Nuggets (http://www.kdnuggets.com)

• Implementations

– Obtaining source code from project page

• AutoClass III: Lisp implementation [Cheeseman, Stutz, Taylor, 1992]

• AutoClass C: C implementation [Cheeseman, Stutz, Taylor, 1998]

– These and others at: http://www.recursive-partitioning.com/cluster.html

Page 7: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:Competitive Learning for Feature DiscoveryCompetitive Learning for Feature Discovery

• Intuitive Idea: Competitive Mechanisms for Unsupervised Learning

– Global organization from local, competitive weight update

• Basic principle expressed by Von der Malsburg

• Guiding examples from (neuro)biology: lateral inhibition

– Previous work: Hebb, 1949; Rosenblatt, 1959; Von der Malsburg, 1973;

Fukushima, 1975; Grossberg, 1976; Kohonen, 1982

• A Procedural Framework for Unsupervised Connectionist Learning

– Start with identical (“neural”) processing units, with random initial parameters

– Set limit on “activation strength” of each unit

– Allow units to compete for right to respond to a set of inputs

• Feature Discovery

– Identifying (or constructing) new features relevant to supervised learning

– Examples: finding distinguishable letter characteristics in handwriten character

recognition (HCR), optical character recognition (OCR)

– Competitive learning: transform X into X’; train units in X’ closest to x

Page 8: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:Kohonen’s Kohonen’s SSelf-elf-OOrganizing rganizing MMap (SOM) [1]ap (SOM) [1]

• Another Clustering Algorithm

– aka Self-Organizing Feature Map (SOFM)

– Given: vectors of attribute values (x1, x2, …, xn)

– Returns: vectors of attribute values (x1’, x2’, …, xk’)

• Typically, n >> k (n is high, k = 1, 2, or 3; hence “dimensionality reducing”)

• Output: vectors x’, the projections of input points x; also get P(xj’ | xi)

• Mapping from x to x’ is topology preserving

• Topology Preserving Networks

– Intuitive idea: similar input vectors will map to similar clusters

– Recall: informal definition of cluster (isolated set of mutually similar entities)

– Restatement: “clusters of X (high-D) will still be clusters of X’ (low-D)”

• Representation of Node Clusters

– Group of neighboring artificial neural network units (neighborhood of nodes)

– SOMs: combine ideas of topology-preserving networks, unsupervised learning

• Implementation: http://www.cis.hut.fi/nnrc/ and MATLAB NN Toolkit

Page 9: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

• Kohonen Network (SOM) for Clustering

– Training algorithm: unnormalized competitive learning

– Map is organized as a grid (shown here in 2D)

• Each node (grid element) has a weight vector wj

• Dimension of wj is n (same as input vector)

• Number of trainable parameters (weights): m · m · n for an m-by-m SOM

• 1999 state-of-the-art: typical small SOMs 5-20, “industrial strength” > 20

– Output found by selecting j* whose wj has minimum Euclidean distance from x

• Only one active node, aka Winner-Take-All (WTA): winning node j*

• i.e., j* = arg minj || wj - x ||2

• Update Rule

• Same as competitive learning algorithm, with one modification

• Neighborhood function associated with j* spreads the wj around

otherwise tw

*jodNeighborho j if twxhtrtwtw

j

j*jj,j

j

1

Unsupervised Learning:Unsupervised Learning:Kohonen’s Kohonen’s SSelf-elf-OOrganizing rganizing MMap (SOM) [2]ap (SOM) [2]

x : vectorin n-space

x’ : vectorin 2-space

Page 10: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:Kohonen’s Kohonen’s SSelf-elf-OOrganizing rganizing MMap (SOM) [3]ap (SOM) [3]

j*

• Traditional Competitive Learning

• Only train j*

• Corresponds to neighborhood of 0

• Neighborhood Function hj, j*

– For 2D Kohonen SOMs, h is typically a square or hexagonal region

• j*, the winner, is at the center of Neighborhood (j*)

• hj*, j* 1

– Nodes in Neighborhood (j) updated whenever j wins, i.e., j* = j

– Strength of information fed back to wj is inversely proportional to its distance

from the j* for each x

– Often use exponential or Gaussian (normal) distribution on neighborhood to

decay weight delta as distance from j* increases

• Annealing of Training Parameters

– Neighborhood must shrink to 0 to achieve convergence

– r (learning rate) must also decrease monotonically

Neighborhood of 1

Page 11: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:SOM and Other Projections for ClusteringSOM and Other Projections for Clustering

Cluster Formation and Segmentation Algorithm (Sketch)

Dimensionality-Reducing

Projection (x’)Clusters of

Similar Records

DelaunayTriangulation

Voronoi (Nearest Neighbor)

Diagram (y)

Page 12: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning:Unsupervised Learning:Other Algorithms (PCA, Factor Analysis)Other Algorithms (PCA, Factor Analysis)

• Intuitive Idea

– Q: Why are dimensionality-reducing transforms good for supervised learning?

– A: There may be many attributes with undesirable properties, e.g.,

• Irrelevance: xi has little discriminatory power over c(x) = yi

• Sparseness of information: “feature of interest” spread out over many xi’s

(e.g., text document categorization, where xi is a word position)

• We want to increase the “information density” by “squeezing X down”

• Principal Components Analysis (PCA)

– Combining redundant variables into a single variable (aka component, or factor)

– Example: ratings (e.g., Nielsen) and polls (e.g., Gallup); responses to certain

questions may be correlated (e.g., “like fishing?” “time spent boating”)

• Factor Analysis (FA)

– General term for a class of algorithms that includes PCA

– Tutorial: http://www.statsoft.com/textbook/stfacan.html

Page 13: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Clustering Methods:Clustering Methods:Design ChoicesDesign Choices

• Intuition

– Functional (declarative) definition: easy (“We recognize a cluster when we see it”)

– Operational (procedural, constructive) definition: much harder to give

– Possible reason: clustering of objects into groups has taxonomic semantics (e.g.,

shape, size, time, resolution, etc.)

• Possible Assumptions

– Data generated by a particular probabilistic model

– No statistical assumptions

• Design Choices

– Distance (similarity) measure: standard metrics, transformation-invariant metrics

• L1 (Manhattan): |xi - yi|, L2 (Euclidean): , L (Sup): max |xi - yi|

• Symmetry: Mahalanobis distance

• Shift, scale invariance: covariance matrix

– Transformations (e.g., covariance diagonalization: rotate axes to get rotational

invariance, cf. PCA, FA)

2ii yx

Page 14: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Clustering: ApplicationsClustering: Applications

ThemeScapes - http://www.cartia.com

6500 news storiesfrom the WWWin 1997

Information Retrieval:Text DocumentCategorization

Transactional Database MiningNCSA D2K 1.0 - http://www.ncsa.uiuc.edu/STI/ALG/

Confidential and proprietary to Caterpillar; may onlybe used with prior written consent from Caterpillar.

NCSA D2K 2.0 - http://www.ncsa.uiuc.edu/STI/ALG/

Facial Feature Extraction

http://www.cnl.salk.edu/~wiskott/Bibliographies/FaceFeatureFinding.html

Data from T. Mitchell’s web site:http://www.cs.cmu.edu/~tom/faces.html

Page 15: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Unsupervised Learning andUnsupervised Learning andConstructive InductionConstructive Induction

• Unsupervised Learning in Support of Supervised Learning– Given: D labeled vectors (x, y)

– Return: D’ transformed training examples (x’, y’)

– Solution approach: constructive induction

• Feature “construction”: generic term

• Cluster definition

• Feature Construction: Front End– Synthesizing new attributes

• Logical: x1 x2, arithmetic: x1 + x5 / x2

• Other synthetic attributes: f(x1, x2, …, xn), etc.

– Dimensionality-reducing projection, feature extraction

– Subset selection: finding relevant attributes for a given target y

– Partitioning: finding relevant attributes for given targets y1, y2, …, yp

• Cluster Definition: Back End– Form, segment, and label clusters to get intermediate targets y’

– Change of representation: find an (x’, y’) that is good for learning target y

ConstructiveInduction

(x, y)

x’ / (x1’, …, xp’)

ClusterDefinition

(x’, y’) or ((x1’, y1’), …, (xp’, yp’))

Feature (Attribute)Construction and

Partitioning

Page 16: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Clustering:Clustering:Relation to Constructive InductionRelation to Constructive Induction

• Clustering versus Cluster Definition

– Clustering: 3-step process

– Cluster definition: “back end” for feature construction

• Clustering: 3-Step Process

– Form

• (x1’, …, xk’) in terms of (x1, …, xn)

• NB: typically part of construction step, sometimes integrates both

– Segment

• (y1’, …, yJ’) in terms of (x1’, …, xk’)

• NB: number of clusters J not necessarily same as number of dimensions k

– Label

• Assign names (discrete/symbolic labels (v1’, …, vJ’)) to (y1’, …, yJ’)

• Important in document categorization (e.g., clustering text for info retrieval)

• Hierarchical Clustering: Applying Clustering Recursively

Page 17: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

CLUTOCLUTO

• Clustering Algorithms– High-performance & High-quality

partitional clustering

– High-quality agglomerative clustering

– High-quality graph-partitioning-based clustering

– Hybrid partitional & agglomerative algorithms for building trees for very large datasets.

• Cluster Analysis Tools– Cluster signature identification

– Cluster organization identification

• Visualization Tools– Hierarchical Trees

– High-dimensional datasets

– Cluster relations

• Interfaces– Stand-alone programs

– Library with a fully published API

• Available on Windows, Sun, and Linux

http://www.cs.umn.edu/~cluto http://www.cs.umn.edu/~cluto

Page 18: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

TodayToday

• Clustering– Distance Measures– Graph-based Techniques– K-Means Clustering– Tools and Software for Clustering

Page 19: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Prediction, Clustering, ClassificationPrediction, Clustering, Classification

• What is Prediction?– The goal of prediction is to forecast or deduce the value of an

attribute based on values of other attributes– A model is first created based on the data distribution– The model is then used to predict future or unknown values

• Supervised vs. Unsupervised Classification– Supervised Classification = Classification

• We know the class labels and the number of classes

– Unsupervised Classification = Clustering

• We do not know the class labels and may not know the number of classes

Page 20: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

What is Clustering in Data Mining?What is Clustering in Data Mining?

• Cluster:– a collection of data objects that

are “similar” to one another and thus can be treated collectively as one group

– but as a collection, they are sufficiently different from other groups

• Clustering– unsupervised classification– no predefined classes

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters

Helps users understand the natural grouping or structure in a data set

Page 21: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Requirements of Clustering MethodsRequirements of Clustering Methods

• Scalability• Dealing with different types of attributes• Discovery of clusters with arbitrary shape• Minimal requirements for domain knowledge to

determine input parameters• Able to deal with noise and outliers• Insensitive to order of input records• The curse of dimensionality• Interpretability and usability

• Scalability• Dealing with different types of attributes• Discovery of clusters with arbitrary shape• Minimal requirements for domain knowledge to

determine input parameters• Able to deal with noise and outliers• Insensitive to order of input records• The curse of dimensionality• Interpretability and usability

Page 22: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Applications of ClusteringApplications of Clustering

• Clustering has wide applications in Pattern Recognition

• Spatial Data Analysis:– create thematic maps in GIS by clustering feature spaces– detect spatial clusters and explain them in spatial data mining

• Image Processing

• Market Research

• Information Retrieval– Document or term categorization– Information visualization and IR interfaces

• Web Mining– Cluster Web usage data to discover groups of similar access

patterns– Web Personalization

Page 23: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Clustering MethodologiesClustering Methodologies

• Two general methodologies– Partitioning Based Algorithms– Hierarchical Algorithms

• Partitioning Based– divide a set of N items into K clusters (top-down)

• Hierarchical– agglomerative: pairs of items or clusters are successively linked to produce larger

clusters– divisive: start with the whole set as a cluster and successively divide sets into

smaller partitions

Page 24: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Distance or Similarity MeasuresDistance or Similarity Measures

• Measuring Distance– In order to group similar items, we need a way to measure the distance

between objects (e.g., records)– Note: distance = inverse of similarity– Often based on the representation of objects as “feature vectors”

ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000

T1 T2 T3 T4 T5 T6Doc1 0 4 0 0 0 2Doc2 3 1 4 3 1 2Doc3 3 0 0 0 3 0Doc4 0 1 0 3 0 0Doc5 2 2 2 3 1 4

An Employee DB Term Frequencies for Documents

Which objects are more similar?

Page 25: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Distance or Similarity MeasuresDistance or Similarity Measures

• Properties of Distance Measures:– for all objects A and B, dist(A, B) 0, and dist(A, B) = dist(B, A)– for any object A, dist(A, A) = 0– dist(A, C) dist(A, B) + dist (B, C)

• Common Distance Measures:

– Manhattan distance:

– Euclidean distance:

– Cosine similarity:

1 2, , , nX x x x 1 2, , , nY y y y

1 1 2 2( , ) n ndist X Y x y x y x y

2 2

1 1( , ) n ndist X Y x y x y

Can be normalizedto make values fallbetween 0 and 1.

Can be normalizedto make values fallbetween 0 and 1.

( , ) 1 ( , )dist X Y sim X Y 2 2

( )( , )

i ii

i ii i

x ysim X Y

x y

Page 26: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Distance or Similarity MeasuresDistance or Similarity Measures• Weighting Attributes

– in some cases we want some attributes to count more than others– associate a weight with each of the attributes in calculating distance, e.g.,

• Nominal (categorical) Attributes– can use simple matching: distance=1 if values match, 0 otherwise– or convert each nominal attribute to a set of binary attribute, then use the

usual distance measure– if all attributes are nominal, we can normalize by dividing the number of

matches by the total number of attributes

• Normalization:– want values to fall between 0 an 1:– other variations possible

2 2

1 1 1( , ) n n ndist X Y w x y w x y

min'

max mini i

i

i i

x xx

x x

Page 27: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Distance or Similarity MeasuresDistance or Similarity Measures

• Example

– max distance for age: 100000-19000 = 79000– max distance for age: 52-27 = 25

– dist(ID2, ID3) = SQRT( 0 + (0.04)2 + (0.44)2 ) = 0.44 – dist(ID2, ID4) = SQRT( 1 + (0.72)2 + (0.12)2 ) = 1.24

ID Gender Age Salary1 F 27 19,0002 M 51 64,0003 M 52 100,0004 F 33 55,0005 M 45 45,000

min'

max mini i

i

i i

x xx

x x

ID Gender Age Salary1 1 0.00 0.002 0 0.96 0.563 0 1.00 1.004 1 0.24 0.445 0 0.72 0.32

Page 28: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Domain Specific Distance FunctionsDomain Specific Distance Functions

• For some data sets, we may need to use specialized functions

– we may want a single or a selected group of attributes to be used in the computation of distance - same problem as “feature selection”

– may want to use special properties of one or more attribute in the data

– natural distance functions may exist in the data

Example: Zip Codesdistzip(A, B) = 0, if zip codes are identicaldistzip(A, B) = 0.1, if first 3 digits are identicaldistzip(A, B) = 0.5, if first digits are identicaldistzip(A, B) = 1, if first digits are different

Example: Zip Codesdistzip(A, B) = 0, if zip codes are identicaldistzip(A, B) = 0.1, if first 3 digits are identicaldistzip(A, B) = 0.5, if first digits are identicaldistzip(A, B) = 1, if first digits are different

Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not responddistsolicit(A, B) = 0.5, both A and B were chosen, but only one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not

Example: Customer Solicitationdistsolicit(A, B) = 0, if both A and B respondeddistsolicit(A, B) = 0.1, both A and B were chosen but did not responddistsolicit(A, B) = 0.5, both A and B were chosen, but only one respondeddistsolicit(A, B) = 1, one was chosen, but the other was not

Page 29: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Distance (Similarity) MatrixDistance (Similarity) Matrix

similarity (or distance) of to ij i jd D D

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

Note that dij = dji (i.e., the matrix is symmetric. So, we only need the lower triangle part of the matrix.

The diagonal is all 1’s (similarity) or all 0’s (distance)

1 2

1 12 1

2 21 2

1 2

n

n

n

n n n

I I I

I d d

I d d

I d d

• Similarity (Distance) Matrix– based on the distance or similarity measure we can construct a symmetric matrix of

distance (or similarity values)– (i, j) entry in the matrix is the distance (similarity) between items i and j

Page 30: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Example: Term Similarities in DocumentsExample: Term Similarities in Documents

T1 T2 T3 T4 T5 T6 T7 T8Doc1 0 4 0 0 0 2 1 3Doc2 3 1 4 3 1 2 0 1Doc3 3 0 0 0 3 0 3 0Doc4 0 1 0 3 0 0 2 0Doc5 2 2 2 3 1 4 0 2

sim T T w wi j jkikk

N

( , ) ( )

1

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

Term-TermSimilarity Matrix

Term-TermSimilarity Matrix

Page 31: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Similarity (Distance) ThresholdsSimilarity (Distance) Thresholds– A similarity (distance) threshold may be used to mark pairs that are

“sufficiently” similar

T1 T2 T3 T4 T5 T6 T7T2 7T3 16 8T4 15 12 18T5 14 3 6 6T6 14 18 16 18 6T7 9 6 0 6 9 2T8 7 17 8 9 3 16 3

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

Using a threshold value of 10 in the previous example

Page 32: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Graph RepresentationGraph Representation

• The similarity matrix can be visualized as an undirected graph– each item is represented by a node, and edges represent the fact that two

items are similar (a one in the similarity threshold matrix)

T1 T2 T3 T4 T5 T6 T7T2 0T3 1 0T4 1 1 1T5 1 0 0 0T6 1 1 1 1 0T7 0 0 0 0 0 0T8 0 1 0 0 0 1 0

T1 T3

T4

T6T8

T5

T2

T7If no threshold is used, thenmatrix can be represented asa weighted graph

If no threshold is used, thenmatrix can be represented asa weighted graph

Page 33: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Simple Clustering AlgorithmsSimple Clustering Algorithms

• If we are interested only in threshold (and not the degree of similarity or distance), we can use the graph directly for clustering

• Clique Method (complete link)– all items within a cluster must be within the similarity threshold of all

other items in that cluster– clusters may overlap– generally produces small but very tight clusters

• Single Link Method– any item in a cluster must be within the similarity threshold of at least one

other item in that cluster– produces larger but weaker clusters

• Other methods– star method - start with an item and place all related items in that cluster– string method - start with an item; place one related item in that cluster;

then place anther item related to the last item entered, and so on

Page 34: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Simple Clustering AlgorithmsSimple Clustering Algorithms

• Clique Method– a clique is a completely connected subgraph of a graph– in the clique method, each maximal clique in the graph becomes a cluster

T1 T3

T4

T6T8

T5

T2

T7

Maximal cliques (and therefore the clusters) in the previous example are:

{T1, T3, T4, T6}{T2, T4, T6}{T2, T6, T8}{T1, T5}{T7}

Note that, for example, {T1, T3, T4} is also a clique, but is not maximal.

Page 35: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Simple Clustering AlgorithmsSimple Clustering Algorithms• Single Link Method

– selected an item not in a cluster and place it in a new cluster– place all other similar item in that cluster– repeat step 2 for each item in the cluster until nothing more can be added– repeat steps 1-3 for each item that remains unclustered

T1 T3

T4

T6T8

T5

T2

T7

In this case the single link method produces only two clusters:

{T1, T3, T4, T5, T6, T2, T8} {T7}

Note that the single link method does not allow overlapping clusters, thus partitioning the set of items.

Page 36: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Clustering with Existing ClustersClustering with Existing Clusters• The notion of comparing item similarities can be extended to clusters themselves,

by focusing on a representative vector for each cluster– cluster representatives can be actual items in the cluster or other “virtual”

representatives such as the centroid– this methodology reduces the number of similarity computations in clustering– clusters are revised successively until a stopping condition is satisfied, or until

no more changes to clusters can be made• Partitioning Methods

– reallocation method - start with an initial assignment of items to clusters and then move items from cluster to cluster to obtain an improved partitioning

– Single pass method - simple and efficient, but produces large clusters, and depends on order in which items are processed

• Hierarchical Agglomerative Methods– starts with individual items and combines into clusters– then successively combine smaller clusters to form larger ones– grouping of individual items can be based on any of the methods discussed

earlier

Page 37: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

K-Means AlgorithmK-Means Algorithm

• The basic algorithm (based on reallocation method):

1. select K data points as the initial representatives

2. for i = 1 to N, assign item xi to the most similar centroid (this gives K clusters)

3. for j = 1 to K, recalculate the cluster centroid Cj

4. repeat steps 2 and 3 until these is (little or) no change in clusters

• Example: Clustering Terms

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 4/2 0/2 2/2Doc2 3 1 4 3 1 2 0 1 4/2 7/2 3/2Doc3 3 0 0 0 3 0 3 0 3/2 0/2 3/2Doc4 0 1 0 3 0 0 2 0 1/2 3/2 0/2Doc5 2 2 2 3 1 4 0 2 4/2 5/2 5/2

Initial (arbitrary) assignment:C1 = {T1,T2}, C2 = {T3,T4}, C3 = {T5,T6} Cluster Centroids

Page 38: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Example: K-MeansExample: K-Means

• Example (continued)

T1 T2 T3 T4 T5 T6 T7 T8Class1 29/2 29/2 24/2 27/2 17/2 32/2 15/2 24/2Class2 31/2 20/2 38/2 45/2 12/2 34/2 6/2 17/2Class3 28/2 21/2 22/2 24/2 17/2 30/2 11/2 19/2Assign to Class2 Class1 Class2 Class2 Class3 Class2 Class1 Class1

Now using simple similarity measure, compute the new cluster-term similarity matrix

Now compute new cluster centroids using the original document-term matrix

T1 T2 T3 T4 T5 T6 T7 T8 C1 C2 C3Doc1 0 4 0 0 0 2 1 3 8/3 2/4 0/1Doc2 3 1 4 3 1 2 0 1 2/3 12/4 1/1Doc3 3 0 0 0 3 0 3 0 3/3 3/4 3/1Doc4 0 1 0 3 0 0 2 0 3/3 3/4 0/1Doc5 2 2 2 3 1 4 0 2 4/3 11/4 1/1

The process is repeated until no further changes are made to the clusters

Page 39: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

K-Means AlgorithmK-Means Algorithm

• Strength of the k-means:– Relatively efficient: O(tkn), where n is # of objects, k is # of

clusters, and t is # of iterations. Normally, k, t << n– Often terminates at a local optimum

• Weakness of the k-means:– Applicable only when mean is defined; what about categorical

data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers

• Variations of K-Means usually differ in:– Selection of the initial k means– Dissimilarity calculations– Strategies to calculate cluster means

Page 40: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

TerminologyTerminology

• Expectation-Maximization (EM) Algorithm

– Iterative refinement: repeat until convergence to a locally optimal label

– Expectation step: estimate parameters with which to simulate data

– Maximization step: use simulated (“fictitious”) data to update parameters

• Unsupervised Learning and Clustering

– Constructive induction: using unsupervised learning for supervised learning

• Feature construction: “front end” - construct new x values

• Cluster definition: “back end” - use these to reformulate y

– Clustering problems: formation, segmentation, labeling

– Key criterion: distance metric (points closer intra-cluster than inter-cluster)

– Algorithms

• AutoClass: Bayesian clustering

• Principal Components Analysis (PCA), factor analysis (FA)

• Self-Organizing Maps (SOM): topology preserving transform (dimensionality

reduction) for competitive unsupervised learning

Page 41: Monday, 10 March 2008 William H. Hsu Department of Computing and Information Sciences, KSU

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Summary PointsSummary Points

• Expectation-Maximization (EM) Algorithm

• Unsupervised Learning and Clustering

– Types of unsupervised learning

• Clustering, vector quantization

• Feature extraction (typically, dimensionality reduction)

– Constructive induction: unsupervised learning in support of supervised learning

• Feature construction (aka feature extraction)

• Cluster definition

– Algorithms

• EM: mixture parameter estimation (e.g., for AutoClass)

• AutoClass: Bayesian clustering

• Principal Components Analysis (PCA), factor analysis (FA)

• Self-Organizing Maps (SOM): projection of data; competitive algorithm

– Clustering problems: formation, segmentation, labeling

• Next Lecture: Time Series Learning and Characterization