Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the...

44
Prof. Enza Messina Lesson 1

Transcript of Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the...

Page 1: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Prof. Enza Messina

Lesson 1

Page 2: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Classification

Clustering

Page 3: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Feature X

Featu

re Y

Page 4: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Classification ◦ Have labels for some points

Feature X

Featu

re Y

Page 5: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Classification ◦ Have labels for some points

◦ Want a “rule” that will accurately assign labels to new points

◦ Supervised learning

Feature X

Featu

re Y

Page 6: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Classification ◦ Have labels for some points

◦ Want a “rule” that will accurately assign labels to new points

◦ Supervised learning

Feature X

Featu

re Y

Page 7: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Classification ◦ Have labels for some points

◦ Want a “rule” that will accurately assign labels to new points

◦ Supervised learning

Clustering ◦ No labels

◦ Group points into clusters based on how “near” they are to one another

◦ Identify structure in data

Unsupervised learning

Feature X

Featu

re Y

Page 8: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Objects characterized by one or more features

Classification ◦ Have labels for some points

◦ Want a “rule” that will accurately assign labels to new points

◦ Supervised learning

Clustering ◦ No labels

◦ Group points into clusters based on how “near” they are to one another

◦ Identify structure in data

Unsupervised learning

Feature X

Featu

re Y

Page 9: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Classification

Given a set of data vectors , where d is the input space dimensionality (number of features), they are mapped to a set of class labels, represented as , where C is the total number of classes. This mapping is modeled in terms of a mathematical function , where w is a vector of adjustable parameters. These parameters are determined (optimized) by a learning algorithm, based on a dataset of input-output examples

Page 10: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Classification

Given a set of data vectors , where d is the input space dimensionality (number of features), they are mapped to a set of class labels, represented as , where C is the total number of classes. This mapping is modeled in terms of a mathematical function , where w is a vector of adjustable parameters. These parameters are determined (optimized) by a learning algorithm, based on a dataset of input-output examples

In clustering, labeled data is unavailable!

Page 11: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can
Page 12: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Clustering is a subjective process

• “In cluster analysis a group of objects is split up into a number of more or less homogeneous subgroups on the basis of an often subjectively chosen measure of similarity (i.e., chosen subjectively based on its ability to create “interesting” clusters), such that the similarity between objects within a subgroup is larger than the similarity between objects belonging to different subgroups” (Backer and Jain, 1981)

• A different clustering criterion or clustering algorithm, even for the same algorithm but with different selection of parameters, may cause completely different clustering results

Page 13: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can
Page 14: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• “A cluster is a set of entities which are alike, and entities from different clusters are not alike.”

• “A cluster is an aggregate of points in the test space such that the distance between any two points in the cluster is less than the distance between any point in the cluster and any point not in it.”

• “Clusters may be described as continuous regions of this space ( d-dimensional feature space) containing a relatively high density of points, separated from other such regions by regions containing a relatively low density of points. “

Page 15: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Given a set of input patterns , where , with each measure xji called a feature (also attribute, dimension or variable). Hard partitional clustering attempts to seek a K-partition of X, C = {C1, …, CK) (K N), such that: • • •

Page 16: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

It may also be possible that an object is allowed to belong to all K clusters with a degree of membership, , which represents the membership coefficient of the jth object in the ith cluster and satisfies the following constraints: and This is known as fuzzy clustering.

Page 17: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can
Page 18: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

E.g. Principal Component

Analysis

Page 19: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Partition-based?

Hierarchical? Density-based?

Page 20: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

Are the clusters

meaningful?

Page 21: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

A set of clusters is not itself a finished result but only a possible outline. Further experiments may be required

Page 22: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Engineering: biometric and speech recognition, radar signal analysis, information compression and noise removal;

• Computer Science: web mining, spatial database analysis, information retrieval, image segmentation;

• Life and medical sciences: tanonomy definition, gene and protein function identification, disease diagnosis and treatment;

• Social sciences: behavior pattern analysis, analysis of social networks, study of criminal psychology;

• Economics: customer characterization, purchasing pattern recognition, stock trend analysis.

Page 23: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can
Page 24: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• A data object is described by a set of features or variables, usually represented as a multidimensional vector

• A feature can be classified as: • Continuous

• Discrete

• Binary

Page 25: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Another property of features is the measurement level, which reflects the relative significance of numbers

• A feature can be: • Nominal (labels without a specified order)

• Ordinal (labels with a specified order)

• Interval (numerical values, without a true zero – e.g. Celsius degrees)

• Ratio (numerical values, with a true zero – e.g. Kelvin degrees)

Page 26: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Missing data are quite common for real data sets due to all kinds of limitations and uncertainties

• How can we deal with them?

• The first thought of how to deal with missing data may be to discard the records that contain missing features

• This approach can only work when the number of objects that have missing features is much smaller than the total number of objects in the data set.

Page 27: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Approach #1: calculating the proximity by only using the feature values that are available:

where is the distance between each pair of components of the two objects, and

Page 28: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Approach #2: calculate the average distance between all pairs of data objects along all the features and use that to estimate distances for the missing features.

The average distance for the lth feature is obtained as:

Therefore, the distance for the missing feature is obtained as:

Page 29: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• A dissimilarity or distance function on a data set X is defined to satisfy the following conditions:

• Symmetry

• Positivity

• If the following also hold, then it is called a metric

• Triangle Inequality

• Reflexivity

Page 30: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Likewise, a similarity function on a data set X is defined to satisfy the following conditions:

• Symmetry

• Positivity

• If the following also hold, then it is called a similarity metric

• Triangle Inequality

• Reflexivity

Because D(i,j) = 1/S(i,j)

Page 31: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Perhaps the most commonly known is the Euclidean distance, or L2 norm, represented as:

• This measure tends to form hyperspherical clusters

• If the features are measured with very different units, features with large values and variances will dominate the others

Page 32: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• How can we deal with features with different scales?

• We standardize the data, so that each feature has zero mean and unit variance (z-normalization):

• Another approach is based on the maximum and minimum of the data, so that all features lie in range [0,1]

Page 33: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• The Euclidean distance can be generalized as a special case of a family of metrics, called Minkowsky, or Lp norm:

• When p=2, this distance becomes the Euclidean distance

• When p=1, this distance is called Manhattan distance, or L1 norm.

• When p=∞, this is called sup distance or L∞ norm.

Page 34: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• The Mahalanobis distance is defined as:

where S is the within-class covariance matrix defined as

where μ is the mean vector and E the expected value.

• This distance is effective when features are correlated

• When the features are uncorrelated, this distance is equivalent to the Euclidean distance

• It is computationally expensive for large-scale datasets

Page 35: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• The distance measure can also be derived from a correlation coefficient, such as the Pearson correlation coefficient:

where

• The correlation coefficient is in the range [-1, 1], so the distance measure is defined as:

• This measure disclose difference in shapes rather than detect the magnitude of differences between the two objects.

Page 36: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• One of the most used similarity measure for continuous variables is the cosine similarity:

• If two objects are similar, they will be more parallel in the feature space, and the cosine value will be greater.

• Similar to Pearson correlation coefficient, this measure is unable to provide information on the magniture of differences

• Cosine similarity can be constructed as a distance measure by simply using D(xi,xj) = 1 – S(xi,xj)

Page 37: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Similarity measures are most commonly used for discrete features, and many of them take only two values

• Binary features can be classified as:

• Symmetric : both values are equally significant

• Asymmetric : one value (often represented as 1) is more important than the other

• Examples: • The feature of gender is symmetric (“female” and “male” can be encoded

as 1 or 0 without affecting the evaluation of the similarity

• The presence/absence of a rare form of a gene can be asymmetric (the presence may be more important then the absence)

Page 38: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Similarity measures for symmetric binary variables are:

• Different values of w have been proposed: • W = 1 , simple matching coefficient

• W = 2 , Rogers and Tanimoto

• W = 1/2 , Sokal and Sneath

• These invariant similarity measures regard the 1-1 match and 0-0 match as equally important

• The corresponding dissimilarity measure is known as Hamming distance

Page 39: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Similarity measures for asymmetric binary variables are:

• Different values of w have been proposed: • W = 1 , Jaccard coefficient

• W = 2 , Sokal and Sneath

• W = 1/2 , Dice

• These non-invariant similarity measures focus on the 1-1 match while ignoring the effect of 0-0 match, which is considered uninformative

Page 40: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• For discrete features that have more than two states, a simple and direct strategy is to map them into a larger number of new binary features

• These new binary features are asymmetric

• This may introduce too many binary variables

• Are there other strategies?

Page 41: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Simple Matching Criterion:

where

Page 42: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• Sometimes, categorical features display certains orders

• The codes from 1 to Ml (where Ml is the number of levels or the highest level for the feature l) are no longer meaningless

• The closer two levels are, the more similar the two objects will be with regards to that feature

• Since the number of possible level varies with the features, the levels are converted in the range [0,1]:

Page 43: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• For real data sets is common to see both continuous and categorical features at the same time

• Generally, we can map all features into the variables [0,1] and apply measures such as Euclidean distance

• But this is unfeasible for categorical variables whose classes are just names without any meaning

• We can transform all features into binary variables and only use binary similarity functions, but this leads to information loss

Page 44: Prof. Enza Messina Lesson 1 · Prof. Enza Messina Lesson 1 . ... as 1 or 0 without affecting the evaluation of the similarity • The presence/absence of a rare form of a gene can

• A more powerful method was proposed by Gower

where Sijl indicates the similarity for the lth feature, and δ is a 0-1 coefficient to track if a feature is missing

For discrete variables:

• For continuous variables:

• where Rl is the range of the lth variable (max – min)