會議記錄 STDC 2/2019 - District Councils...STDC 2/2019 沙田區議會 二零一 九年度第 二次會議記錄 會議日期 :二零一九 年三月二十一 日(星期四)
2015年10月27日星期二 2015年10月27日星期二 2015年10月27日星期二 Data Mining:...
-
Upload
randell-tyler -
Category
Documents
-
view
307 -
download
2
Transcript of 2015年10月27日星期二 2015年10月27日星期二 2015年10月27日星期二 Data Mining:...
二〇二三年四月二十日 Data Mining: Concepts and Techniques 1
Classification and Prediction
(Data Mining: Concepts and Techniques)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 2
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and
prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Performance evaluation Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 3
An Example of Classification(Fruit Classifier)
Classifieroutput
Class label
oval, red, orange, yellow
shape=roundcolor = red
inputfeatures
Apple
shape=roundcolor = orange
Classifier Orange
Classifier Mango
二〇二三年四月二十日 Data Mining: Concepts and Techniques 4
A Graphical Model for Classifier
yClassifierinputfeatures output
class label
::
x1
x2
xn
二〇二三年四月二十日 Data Mining: Concepts and Techniques 5
Model Representation for Classifier
The model is in a form of y = f (x1, x2, …, xn)
yClassifierinputfeatures output
class label
::
x1
x2
xn
Main problems in classifier model construction: • What are x1, …, xn in order to construct an effective f ?• How to get the model f given x1, …, xn ?• How to collect training data with class label y for creating model f
Use a training set to construct a model for the outcomeforecast of future events. Two main types Classification
constructing models that distinguish classes for future forecast
Applications: loan approval, customer classification, recognition of finger print
Model representation: decision-tree, neural network Prediction
constructing models that predict unknown numerical values Applications : price prediction of various securities, assets Model representation: neural network, linear regression
Main Data Mining TechniquesSupervised Learning
6
Use a training set to construct a model for the outcome
forecast of future events. Classification
predicts categorical class labels constructs a classification model to classify new data
Prediction predicts numerical values Constructs a continuous-valued function to predict
unknown or missing values Typical Applications
credit card approval medical diagnosis & treatment Pattern recognition
Classification vs. Prediction
7
二〇二三年四月二十日 Data Mining: Concepts and Techniques 8
Classification vs. Prediction
Classifierinputfeatures output
class label
::
predicted value
Predictorinputfeatures output
::
(continuous value)
(category/nominal value)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 9
Classification—A Two-Step Process
1. Model construction Training set : the data set used for model construction
Class label : each tuple/sample is assumed to belong to a predefined class (determined by the class label attribute)
Model representation: classification rules, decision trees, or mathematical formulae
2. Model usage: for classifying future unknown objects Test set : a data set independent of training set Performance evaluation : to evaluate how good the model is
The known class label (y) of each test sample is compared with the classified result (y’) from the classification model
Accuracy rate : the percentage of test samples that are correctly classified by the model (the ratio of y=y’s)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 10
Model construction
TrainingData(I, O)
ClassificationLearning
Algorithms
ClassifierModel
Model usage
ClassifierModel
inputfeatures output
class label
::
Classification—A Two-Step Process
二〇二三年四月二十日 Data Mining: Concepts and Techniques 11
Classification Process (1): Example for Model Construction
TrainingData
name rank years tenuredMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationLearning
Algorithms
IF rank = ‘professor’OR years > 6
THEN tenured = ‘yes’
Classifier(Model)
tenured = f (rank, years)
input features class label
二〇二三年四月二十日 Data Mining: Concepts and Techniques 12
Classification Process (2): Example for Model Use
TestingData
name rank years tenuredTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Predict tenured?Accuracy
Performance evaluationuse of the model
Classifier
二〇二三年四月二十日 Data Mining: Concepts and Techniques 13
Supervised vs. Unsupervised Learning
Supervised learning (classification)
Aim : establish a classifier model
Supervision : The training data (observations,
measurements, etc.) are accompanied by class
labels indicating the class of the observations
Unsupervised learning (clustering)
The class labels of training data is unknown
Aim : establish the classes or clusters for the
data
二〇二三年四月二十日 Data Mining: Concepts and Techniques 14
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from
association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 15
Issues regarding classification & prediction
1). Data Preparation (Data Preprocessing) Data cleaning
Preprocess data in order to reduce noise and handle missing values
Feature relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation Generalize and/or normalize data
二〇二三年四月二十日 Data Mining: Concepts and Techniques 16
Issues regarding classification and prediction
2). Performance Evaluation of Classification Methods Predictive accuracy Speed scalability (for Big Data Analysis)
time to construct the model time to use the model
Space scalability (for Big Data Analysis) Memory/disk required to construct/use the model
Robustness handling noise and missing values
Interpretability: understanding and insight provided by the model
Goodness of rules size and compactness of classification rules
二〇二三年四月二十日 Data Mining: Concepts and Techniques 17
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from
association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 18
Decision Tree Induction Algorithm
(A Learning Algo. for Classification Model)
Decision Tree Induction Given : a set of training data (<I,O>=< x1, …, xn, y>)
Aim (Find f, where y = f (x1, x2, …, xn) and f in DT form)
To construct a minimal decision tree in order to effectively classify future unknown samples
Decision tree Representation : a flow-chart-like tree structure Internal node denotes a test on an attribute of a
sample Branch represents an outcome of the test Leaf node represent a class labels or a class
distribution
二〇二三年四月二十日 Data Mining: Concepts and Techniques 19
Classification in Decision Tree Induction
1. Generation of decision tree : consisting of two phases Tree construction
Tree is constructed one node by one node in a top-down manner by using training examples.
Tree pruning Identify and remove branching subtrees that
reflect noise or outliers2. Use of decision tree: classifying an unknown sample
Test the attribute values of a sample (with unknown class label) against the decision tree
二〇二三年四月二十日 Data Mining: Concepts and Techniques 20
An Example of Training Dataset( For buys_PC )
age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3
Class label
Input features
no yes fairexcellent
<= 30 > 4030..40
student?
age?
credit rating?
nono yes
yes
yes
: test (input) attribute: class label for Buy_PC
: attribute value
?
A Decision Tree for Predicting buys_PC
Buy_PC = f (age, student, credit rating)
f
(age, student, credit rating)
Buy_PC=y/n
21
二〇二三年四月二十日 Data Mining: Concepts and Techniques 22
Extracting Classification Rules from Trees
Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a condition The leaf node holds the class prediction
Examples of Extracted RulesIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer =
“yes”IF age = “31…40” THEN buys_computer =
“yes”IF age = “>40” AND credit_rating = “excellent”
THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “fair”
THEN buys_computer = “no”
二〇二三年四月二十日 Data Mining: Concepts and Techniques 23
ID3 Algorithm for Decision Tree Induction
Assumption: Attributes are categorical If continuous-valued, they are discretized in advance
Idea of decision tree induction algorithm: Tree is constructed one node by one node in a top-
down recursive divide-and-conquer manner Key : Find the most discriminating attribute at each
node At start, all the training samples are located at the root node At each node, training samples at this node are
used to select the most discriminating attribute on the basis of a heuristic or statistical measure (attribute selection measure)
partitioned training samples into node branches based on the most discriminating attribute and its associated data values of training samples
<= 30 > 4030..40
age?
Partitioning of Training Dataat a Node of a Decision Tree
age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
age income student credit_rating buys_PC>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no>40 medium yes fair yes>40 medium no excellent no
age income student credit_rating buys_PC31…40 high no fair yes31…40 low yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes
24
age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no<=30 medium no fair no<=30 low yes fair yes<=30 medium yes excellent yes
???
age, income, student and credit_rating are tested. age is the most discriminating attribute.
二〇二三年四月二十日 Data Mining: Concepts and Techniques 25
Stopping Conditions (for Decision Tree Induction)
Conditions for stopping recursive data partitioning
All samples at a given node belong to the same class
There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
There are no samples left
二〇二三年四月二十日 Data Mining: Concepts and Techniques 26
Attribute Selection Measure(Find the most discriminating attribute at
each node)
Information gain (ID3/C4.5) All attributes are assumed to be categorical Can be modified for continuous-valued attributes
Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values
for each attribute May need other tools, such as clustering, to get the
possible split values
Can be modified for categorical attributes
二〇二三年四月二十日 Data Mining: Concepts and Techniques 27
Information Gain (of ID3/C4.5)
(A Measure for Attribute Selection)
How to find the most discriminating attribute for each node
Idea : find the attribute with the highest information gain at each node
Assume there are two classes, P and N, in training examples Let the set of training examples S contains
p elements of class P n elements of class N
The information amount, needed to decide if an arbitrary example in S belongs to P or N, is defined asnp
nnp
nnp
pnp
pnpI
22 loglog),(Entropy
:
二〇二三年四月二十日 Data Mining: Concepts and Techniques 28
Information Gain (in Decision Tree Induction)
Assume that using attribute Ak the set S will be partitioned
into sets {S1, S2 , …, Sv} (i.e., # of Ak ’s values= v)
If Si contains pi examples of P and ni examples of N, the
entropy (the expected information amount needed to classify objects in all subtrees of S is
The information amount gained by branching on Ak
Find the attribute Ai with maximal gain for this node
11
),(),(||
||)(
iii
ii
iii
ik npI
np
npnpI
S
SAE
)A(E)n,p(I)A(Gain kk
Example
ij,mjfor)A(Gain)A(Gain ji 1
二〇二三年四月二十日 Data Mining: Concepts and Techniques 29
An Example(Attribute Selection by Information
Gains)
Data
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
I(p, n) = I(9, 5) =0.940
Compute entropies for atrributes : age, income, student, credit_rating
at root node
Information Gain for age:
Similarly
Therefore, attribute age is selected at root node
age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971
69.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
25.0)(),()( ageEnpIageGain
Entropy after splitting using
age
Total split entropy for age:
equation
DataSplitting
二〇二三年四月二十日 Data Mining: Concepts and Techniques 30
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index gini (T) is defined as
where pj is the data percentage of class j in T. If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively by using attribute Ak, the gini index after splitting ginik (T) is defined as
The attribute providing the smallest ginik(T) is chosen to split the node (need to enumerate all possible splitting points for attribute Ak).
n
jp jTgini
1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginik
entropy
二〇二三年四月二十日 Data Mining: Concepts and Techniques 31
Avoid Overfitting in ID3/C4.5
The generated tree may overfit the training data Result in poor accuracy for unseen samples If too many tree branches exist, some may reflect
anomalies due to noise or outliers Two pruning approaches to avoid overfitting
Prepruning: Halt tree construction early, i.e.,Don’t split a node if this would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold Postpruning:
Get a sequence of progressively pruned trees from a “fully grown” tree
Use a set of data different from the training data to decide which is the “best pruned tree”
二〇二三年四月二十日 Data Mining: Concepts and Techniques 32
Enhancements to basic decision tree induction
Allow for continuous-valued attributes
Define new discrete-valued attributes by
dynamically partitioning the continuous attribute
values into a set of discrete intervals
Handle missing attribute values by
Assigning the most common value of the attribute
Assigning a probability to each of the possible values
二〇二三年四月二十日 Data Mining: Concepts and Techniques 33
Why decision tree induction
Decision tree induction a classification learning algorithm.Classification — a typical problem extensively studied by statisticians and machine learning researchersWhy decision tree induction for classification?
convertible to simple and easy to understand classification rules (if-then rules)
can use SQL queries to access databases for each rule to find its associated data and rule coverage rate
comparable classification accuracy with other methods
relatively faster learning speed (than other classification methods)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 34
Presentation of Classification Results
二〇二三年四月二十日 Data Mining: Concepts and Techniques 35
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from
association rule mining Other Classification Methods Prediction Estimating classification Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 36
Bayesian Classification: Why?
Probabilistic prediction and learning: Predict multiple hypotheses Calculate a probability for each hypothesis
Incremental learning : Each training example incrementally increases/decreases
the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.
Standard:
Even when Bayesian methods are computationally intractable, they can provide a benchmark standard against other methods
Given training data D, posterior probability of a hypothesis h, denoted as P(h|D), follows the Bayes theorem
The value of P(h|D) can be obtained according to the values of P(h), P(D), P(D|h)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 37
Bayesian Theorem
)()()|(
DPDhPDhP
)()()|()|(
DPhPhDPDhP
)()()|(
hPhDPhDP
D h
hD
二〇二三年四月二十日 Data Mining: Concepts and Techniques 38
Naive Bayes Classification for m classes
Definition X =<x1, x2, …, xn >: denoting a n-dimensional sample Ci|X : the hypothesis “sample X is of class Ci” P(Ci|X) = probability that sample X is of class Ci
There are m probabilities : P(C1|X), P(C2|X), …, P(Cm|X)
f : assign sample X to class Ci if P(Ci|X) is maximal among P(C1|X), P(C2|X),… P(Ci|X), …, P(Cm|X), i.e.,
The naive Bayesian classifier f assigns an unknown sample X to class Ci if and only if
(X is most likely of class Ci according to probabilities)
ijmjforXCPXCP ji ,1)|()|(
Find f for D, where y = f (x1, x2, …, xn) and f in NBC form
二〇二三年四月二十日 Data Mining: Concepts and Techniques 39
Estimating a-posteriori probabilities
How to find maximal one among P(C1|X), P(C2|X) …, P(Cm|
X) According to Bayes theorem:
P(Ci|X) = P(X|Ci)·P(Ci) / P(X) for 1 ≤ i ≤ m
P(X) is difficult to obtain, but it is a constant for
computing P(Ci|X)’s for all m classes, where 1 ≤ i ≤ m
So, only the values of P(X|Ci)·P(Ci)’s are reqired for 1 ≤ i ≤ m
P(Ci) = relative freq of samples in class Ci
=> can be computed from the training data Remaining problem: How to compute P(X|Ci) ?
二〇二三年四月二十日 Data Mining: Concepts and Techniques 40
Naïve Bayesian Classification
Naïve assumption: attribute independence
P(X|Ci) = P(<x1,…,xn>|Ci) = P(x1|Ci)·…·P(xn|Ci)
Why this assumption ? Make P(X|Ci) computable. X may not be in Ci , therefore P(X|Ci) is not computable Require a minimal # of training data
If k-th attribute of X is categorical:P(xk|Ci) is estimated as the relative freq of samples having
value xk as k-th attribute (Ak=xk) in class Ci , 1 ≤ k ≤ n If k-th attribute is continuous:
P(xk|Ci) is estimated via Gaussian density function (normal
distribution, a function modeled by mean and variance) for k-th attribute by using data in class Ci
二〇二三年四月二十日 Data Mining: Concepts and Techniques 41
Play-tennis example(Predict playing tennis or not on a given day)
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
Given : an unseen input sample<rain, hot, high, false>
Will play tennis ?
P : Play, 9 recordsN : Not play, 5 records
Given : the following training dataProblem : Predict whether to play tennis on a particular day ?
二〇二三年四月二十日 Data Mining: Concepts and Techniques 42
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false> Want to predict “play tennis or not?”
1. Then, the problem is to test “if P(p|X) > P(n|X) ?” i.e., “if P(X|p)·P(p) / P(X) > P(X|n)·P(n) / P(X) ” (Bayesian
Theorem)
2. According to Naïve Bayesian Classification,test if P(X|p) ·P(p) > P(X|n) ·P(n) ?
P(X|p)·P(p) = P(<rain, hot, high, false>|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)
P(X|n)·P(n) = P(<rain, hot, high, false>|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)
3. Need to know the values of above probabilities.
二〇二三年四月二十日 Data Mining: Concepts and Techniques 43
Play-tennis example: estimating P(p), P(n),
outlookP(sunny|p) = 2/9 P(sunny|n) = 3/5
P(overcast|p) = 4/9 P(overcast|n) = 0
P(rain|p) = 3/9 P(rain|n) = 2/5
temperature
P(hot|p) = 2/9 P(hot|n) = 2/5
P(mild|p) = 4/9 P(mild|n) = 2/5
P(cool|p) = 3/9 P(cool|n) = 1/5
humidityP(high|p) = 3/9 P(high|n) = 4/5
P(normal|p) = 6/9 P(normal|n) = 2/5
windyP(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
P(p) = 9/14
P(n) = 5/14
Step 1
Step 2
X = <rain, hot, high, false>P(X|p) ·P(p) > P(X|n) ·P(n) ?
Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N
)C(P)Cx(P
)C|x(Pi
ikik
二〇二三年四月二十日 Data Mining: Concepts and Techniques 44
Play-tennis example: classifying X
An unseen sample X = <rain, hot, high, false> Want to predict “play tennis or not?”
Then, the problem is to test “if P(p|X) > P(n|X) ?” i.e., “if P(X|p)·P(p) / P(X) > P(X|n)·P(n) / P(X) ” (Bayesian
Theorem)
According to Naïve Bayesian Classification,test if P(X|p) ·P(p) > P(X|n) ·P(n) ? P(X|p)·P(p) = P(<rain, hot, high, false>|p)·P(p) = P(rain|
p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =3/9·2/9·3/9·6/9·9/14 = 0.010582
P(X|n)·P(n) = P(<rain, hot, high, false>|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)2/5·2/5·4/5·2/5·5/14 = 0.018286
Sample X is classified in class n (P(X|p) ·P(p) < P(X|n) ·P(n) )
二〇二三年四月二十日 Data Mining: Concepts and Techniques 45
The Independence Assumption
Makes computation possible
Yields optimal classifiers when the assumption is
satisfied
But is seldom satisfied in practice, as attributes
(variables) are often correlated.
Can attempt to overcome this limitation by:
Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes
Sample size must be large enough compared to NBC
二〇二三年四月二十日 Data Mining: Concepts and Techniques 46
Bayesian Belief Networks (I)FamilyHistory
LungCancer
PositiveXRay
Smoker
Emphysema
Dyspnea
LC
~LC
(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)
0.8
0.2
0.5
0.5
0.7
0.3
0.1
0.9
Bayesian Belief Networks
The conditional probability table for the variable LungCancerAssumption : Variables Family History and Smoker are correlated.
二〇二三年四月二十日 Data Mining: Concepts and Techniques 47
Bayesian Belief Networks (II)
Bayesian belief network allows class conditional
independencies between subsets of the variables
A graphical model of causal relationships
Several classes of problems in learning Bayesian
belief networks Given network structure of all related variables => Easy
Given network structure of some related variables =>
Hard
When the network structure is unknown => Harder
二〇二三年四月二十日 Data Mining: Concepts and Techniques 48
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and
prediction Classification by decision tree induction Bayesian Classification Discriminative Classification Other Classification Methods Prediction Estimating classification Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 49
Binary classification as a mathematical mapping computes 2-class categorical labels is a binary function f: X Y mathematically
y = f(X), X X ≡ n, y Y = {+1, –1} (or = {0, 1}) X : input, y : output
Example : classification of personal homepage, SPAM mail(An application of automatic document classification ) y = +1 or –1 (1/0; yes/no; true/false; positive/negative) X = <x1, x2, …, xn > (a keyword frequency vector for a Web
page) x1 : # of keyword 1, e.g., “homepage” x2 : # of keyword 2, e.g., “welcome” …..
Classification: A Mathematical Mapping
二〇二三年四月二十日 Data Mining: Concepts and Techniques 50
Linear Classification Problems
Example : a 2D binary classification problem A sample is a 2-D point The data above the red
line belongs to class ‘x’ The data below red line
belongs to class ‘o’ The data can be linearly
classified by the red line Classifier examples:
SVM Perceptron (an ANN)
x
x
x xxx
x
x
x
x o o
o
oo
o
o
o
o oo
o
o
In linear classification problems, the classification is accomplished by a linear hyperplane.
0 cbyax
02211 xwxw
x
二〇二三年四月二十日 Data Mining: Concepts and Techniques 51
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and
prediction Classification by decision tree induction Bayesian Classification Classification by Backpropagation Other Classification Methods Prediction Estimating classification Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 52
Artificial Neural Networks(A Network of Artificial Neurons)
Advantages prediction accuracy is generally high robust, works when training examples contain
errors output may be discrete, real-valued, or a vector of
discrete or real-valued attributes fast evaluation of the learned target function
Criticism (somewhat) long training time for an optimal model difficult to understand the learned function
(weights) not easy to incorporate prior domain knowledge
二〇二三年四月二十日 Data Mining: Concepts and Techniques 53
Architecture of a Typical Artificial Neural Network(Multi-layer Perceptron)
Input Layer Output Layer
Middle Layer
I n
p u
t S
i g
n a
l s
O u
t p
u t
S
i g n
a l
s
.
.
.
.
.
.
二〇二三年四月二十日 Data Mining: Concepts and Techniques 54
A Neuron as a Simple Computing Element
)(1
n
iiiwxfY function meaning
Neuron Y
Input Signals
x1
x2
xn
Output Signals
Y
Y
Y
w2
w1
wn
Weights
θ Threshold(bias)
ConnectionWeights
二〇二三年四月二十日 Data Mining: Concepts and Techniques 55
Steps of a neuron’s computation
1. computes the weighted sum of the input signals
2. compares the result with a threshold value
3. Produce an output based on a transfer or
activation function as follows:
A Neuron as a Simple Computing Element
)(1
n
iiiwxfY
二〇二三年四月二十日 Data Mining: Concepts and Techniques 56
Various Activation Functions f of a Neuron
S t e p f u n c t io n S ig n f u n c t io n
+ 1
-1
0
+ 1
-1
0X
Y
X
Y
+ 1
-1
0 X
Y
S ig m o id f u n c t io n
+ 1
-1
0 X
Y
L in e a r f u n c t io n
0 if ,0
0 if ,1
X
XY step
0 if ,1
0 if ,1
X
XY sign
Xsigmoid
eY
1
1XY linear
Hidden neuronOutput neuron
Output neuron(function approximation)
n
iiiwxXXfY
1
),(
Construction of Classification Modelvia Network Training
The objective of network training obtain a set of connection weights that makes
almost all the training tuples classified correctly Steps
1. Initialize weights with random values 2. Feed one of the training samples into the network 3. Do the followings for each neuron layer by layer
1. Compute the net input to the neuron as a weighted summation of all the inputs to the neuron
2. Compute the output value using the activation function3. Compute the error by backpropagation algorithm4. Adjust the weights and the bias according to the error
4. Go to Step 2 until convergence
57
Input Layer Output Layer
Middle Layer
I n p
u t
S i
g n
a l
s
O u
t p
u t
S i
g n
a l
s
Loop
Backpropagation Training Algorithm
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector
i
Hj
Ii
Hji
Hj OwI
HjI
Hj
eO
1
1
))(1( Okk
Ok
Okk OTOOErr
k
kOkj
Hj
Hj
Hj Errw)O(OErr 1
Hj
Hj
new,Hj Err
Hj
Ok
Okj
new,Okj OErrww
Ii
Hj
Hji
new,Hji OErrww
k
j
OkO
OjH
i
weights are updated according to backward propagated errors
I).input is propagated forward
58
…
… …
… …
OiI
II).
error
: learning rate
IjH
j
Ok
Hj
Okj
Ok OwI
OkI
Ok
eO
1
1
HjI Hidden
layerThe j-th neuron
I : inputO : output
Hjiw
Okjw
ErrkO
ErrjH
二〇二三年四月二十日 Data Mining: Concepts and Techniques 59
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and
prediction Classification by decision tree induction Bayesian classification Classification by Backpropagation Other classification methods Prediction Estimating classification accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 60
Other Classification Methods
SVM—Support Vector Machines k-nearest neighbor classifier Case-based reasoning Rough set approach Fuzzy set approaches
二〇二三年四月二十日 Data Mining: Concepts and Techniques 61
SVM—Support Vector Machines
A classification method for both linear and nonlinear data
For nonlinear data, a nonlinear mapping is used to transform the training data into a higher dimension
With the new dimension, it searches for the optimal linear separating hyperplane (i.e., “decision boundary”)
With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a linear hyperplane
SVM finds this separating hyperplane using support vectors (“essential” training tuples) and margins (間隔幅度 , defined by the support vectors)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 62
SVM—History and Applications
Vapnik and colleagues (1992)—groundwork from
Vapnik & Chervonenkis’ statistical learning theory
in 1960s
Features: training can be slow but accuracy is high
owing to their ability to model complex nonlinear
decision boundaries (for margin maximization)
Used both for classification and prediction
Applications: handwritten digit recognition, object
recognition, speaker identification
二〇二三年四月二十日 Data Mining: Concepts and Techniques 63
SVM — General Concept(Find a decision boundary with a maximal margin)
decision boundary 1
decision boundary 2
Decision boundary 2 is better than decision boundary 1
二〇二三年四月二十日 Data Mining: Concepts and Techniques 64
SVM— Margins and Support Vectors
Better support vectors
Small Margin Large Margin
Worse support vectors
二〇二三年四月二十日 Data Mining: Concepts and Techniques 65
SVM — Case 1When Data Is Linearly Separable
m
Let data D={(X1, y1), …, (X|D|, y|D|)} be the set of training tuples, where Xi is associated with the class label yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)
SVM searches for the separating hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 66
SVM – Case 1 : Linearly Separable
A separating hyperplane can be written as
W ● X + b = 0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For a 2-D space, a line L ax + by +c=0 can be written as
w0 + w1 x1 + w2 x2 = 0
The hyperplanes defining the two sides of the margin:
H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Support vectors : any training tuples that fall on hyperplanes H1
or H2 (i.e., the sides defining the margin)
This becomes a constrained (convex) quadratic optimization problem: linear constraints and quadratic objective function Quadratic Programming (QP) Lagrangian multipliers
二〇二三年四月二十日 Data Mining: Concepts and Techniques 67
Why Is SVM Effective on High Dimensional Data?
The complexity of SVM classifier is characterized by the # of
support vectors rather than the dimensionality or # of the data
The support vectors are the essential or critical training
examples —they lie closest to the decision boundary (MMH)
If all other training examples are removed and the SVM training
is repeated, the same separating hyperplane would be found
The set of support vectors can be used to compute an (upper)
bound on the expected error rate of an SVM classifier, which is
independent of the data dimensionality
Thus, an SVM with a small number of support vectors can still
have good generalization, even when the dimensionality of the
data is high
Transform the original input data into a higher dimensional space
A 3D input vector is mapped into a new 6D space Search for a linear separating hyperplane in the 6D
space
z1=x1, z2=x2, z3=x3
二〇二三年四月二十日 Data Mining: Concepts and Techniques 68
SVM — Case 2 Linearly Inseparable
A 1
A 2
<x1, x2, x3><z1, z2, z3, z4, z5, z6>
二〇二三年四月二十日 Data Mining: Concepts and Techniques 69
k-NN (k-Nearest Neighbor) Algorithm
A sample X is represented as
A training sample corresponds to a point in an n-D space.
The nearest neighbors are defined in terms of Euclidean
distance:
When given an unknown sample xq, a k-NN classifier
searches the sample space for the k training samples nearest
to the sample xq. Then decide the class of the unknown
sample by majority vote.
The value of k is decided heuristically. . _
+_ xq
+
_ _+
_
_
+
n
iii yxYXD
1
2)(),(
nxxxX ...,, ,21
二〇二三年四月二十日 Data Mining: Concepts and Techniques 70
Discussion on the k-NN Algorithm
k-NN algorithm works only for numeric-valued data Enhancement : distance-weighted k-NN algorithm
Weight the contribution of the k neighbors according to their distance to the query point (sample) xq
giving greater weight to closer neighbors
Similarly, works only for numeric-valued data Robust to noisy data by averaging k nearest neighbors Curse of dimensionality: distance between neighbors
could be dominated by many irrelevant attributes. To overcome it, axes stretch or elimination of the
least relevant attributes.
2)),((1
ixqxDw
二〇二三年四月二十日 Data Mining: Concepts and Techniques 71
Rough Set Approach
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not
belonging to C)
Rough set can also be used for feature reduction. A discernibility matrix is used to detect redundant
attributes.
二〇二三年四月二十日 Data Mining: Concepts and Techniques 72
Fuzzy Set Approaches
Example Application : Credit approval Credit Approval Rule :
IF (years_employed >=2) AND (income >=50K), THEN Credit = “approval”
Problem :What if a customer has had a job for at least two years and her income is $49K ?
Should she be approved or not ? Solution : Fuzzy set approaches
IF (years_employed is medium) AND (income is high), THEN Credit is approval
二〇二三年四月二十日 Data Mining: Concepts and Techniques 73
Fuzzy Set Approaches
Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership
Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low,
medium, high} with fuzzy values calculated as
For income, 49K is transformed into
Each applicable rule of the rule set contributes a vote for membership in the categories
Typically, the truth values for each predicted category are summed up with weights for making decision
]1,0[,,;,, hmlhmlx
fuzzy membership graph
9.0,1.0,0
二〇二三年四月二十日 Data Mining: Concepts and Techniques 74
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and
prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Estimating classification accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 75
What Is Prediction?
Prediction is similar to classification Step 1: construct a model Step 2: use the model to predict unknown value
Major method for prediction is regression Linear multiple regression Non-linear regression
Other method : artificial neural networks Main difference between prediction and
classification Classification predict categorical class label Prediction models continuous-valued functions
2211 xxy 3
32
21 xxxy
二〇二三年四月二十日 Data Mining: Concepts and Techniques 76
Linear regression: Y = + X Two parameters , and specify the line They are estimated by using the training samples
Using least squares criterion to the training samples: (X1, Y1), (X2, Y2) …, (Xn, Yn)
Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn
Many nonlinear functions can be transformed into the above.
Log-linear models: Example : Estimate probability:
p(a, b, c, d) = αabc abdγacd bcd
log p(a, b, c, d) = log abc +log abd+logγacd +log bcd
Regression Analysis and Log-Linear Models in Prediction
二〇二三年四月二十日 Data Mining: Concepts and Techniques 77
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from
association rule mining Other Classification Methods Prediction Estimating accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 78
Classifier Accuracy Measures
Accuracy of a classifier M, acc(M): percentage of test samples that are correctly classified by the classifier M (created by training set) Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j , an entry in a confusion matrix, indicates
# of samples in class i that are labeled by the classifier as class j Alternative performance measures (e.g., for cancer diagnosis)
sensitivity = TP/P /* true positive recognition rate */specificity = TN/N /* true negative recognition rate */precision = TP/(TP + FP)
accuracy = (TP+TN)/(P + N) = sensitivity * P/(P + N) + specificity * N/(P + N)
This model can also be used for cost-benefit analysis
classes yes (computed) no (computed) total recognition(%)
buy_computer = yes
6954 46 7000 99.34
buy_computer = no
412 2588 3000 86.27
total 7366 2634 10000
95.42
P’
(computed)N’ (computed)
P True positive False negative
N False positive
True negative
二〇二三年四月二十日 Data Mining: Concepts and Techniques 79
Error Measures for Prediction
Measure how far off the predicted value is from the actual known value Loss function: measures the error betw. yi and yi’ (predicted value)
Absolute error: | yi – yi’| Squared error: (yi – yi’)2 Test error: the average loss over the test set
Mean absolute error (MAE): Mean squared error (MSE):
Relative absolute error (RAE): Relative squared error (RSE):
The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-squared error, similarly, root
relative squared error
d
yyd
iii
1
|'|
d
yyd
iii
1
2)'(
dyy
dyy
d
ii
d
iii
/||
/|'|
1
1
d
ii
d
iii
dyy
dyy
1
2
1
2
/)(
/)'(
二〇二三年四月二十日 Data Mining: Concepts and Techniques 80
Performance Evaluation of Classification(Methods for Estimating Average Classification Accuracy)
Partition: Training-and-testing use two independent data sets: training set (2/3), test
set(1/3) Used for data set with large number of samples
k-fold cross-validation Randomly divide the data set into k subsets : S1, S2, …,Sk
At iteration i , the Si subset is used as test set and the remaining k-1 subsets are used as training data
A total of k times for computing average accuracy Used for data set of moderate size
Bootstrapping (leave-one-out) Similar to k-fold cross-validation with k set to s, where s is
the number of initial samples Used for small data set
二〇二三年四月二十日 Data Mining: Concepts and Techniques 81
10-fold Cross-Validation
Data set
Randomly divided into 10 subsets
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Nine out of ten are used for training the classifier The tenth subset is
used as the test set in iteration
10 10
Training setTest set
Repeat iterations :1 to 10
12
34
5
678
9
At iteration 10
二〇二三年四月二十日 Data Mining: Concepts and Techniques 82
Model Comparisonby ROC Curves
ROC (Receiver Operating Characteristics) curves: for visual comparison of the performance of classification models
Vertical axis represents TP (true positive) rate Horizontal axis represents FP (false positive) rate The plot also shows a diagonal line Model 1 is better than model 2
model 1
model 2 diagonal line
(coin model)
二〇二三年四月二十日 Data Mining: Concepts and Techniques 83
Model Comparison by ROC Curves
Originated from signal detection theory Shows the trade-off between the true positive
rate and the false positive rate The area under the ROC curve (AUC) is a
measure of the performance of the model A model with perfect accuracy will have an area
of 1.0
The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model
二〇二三年四月二十日 Data Mining: Concepts and Techniques 84
Chapter 6 Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from
association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary
二〇二三年四月二十日 Data Mining: Concepts and Techniques 85
Summary
Classification is an extensively studied problem (mainly
in statistics, machine learning & AI)
Classification issue : How to create a classifier, i.e., find
f,
where y = f (x1, x2, …, xn) and f in DT, NBC, ANN,… forms
Classification is probably one of the most widely used
data mining techniques with a lot of extensions
Scalability is an important issue for applications
related to Big Data, Clouds