Machine Learning Tutorial
Transcript of Machine Learning Tutorial
-
7/30/2019 Machine Learning Tutorial
1/33
CB GS REC
Machine Learning basic concepts
Machine Learning Tutorial for the UKP lab,
-
7/30/2019 Machine Learning Tutorial
2/33
This ppt includes some slides/slide-parts/text taken
from online materials created by the following
- Greg Grudic- Alexander Vezhnevets- Hal III Daume
-
7/30/2019 Machine Learning Tutorial
3/33
The goal of machine learning is to build computer
systems that can adapt and learn from theirexperience.
Tom Dietterich
3SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
4/33
1x
1ySystem 2
, , ...,h h hN M
1 2, ,..., Nx x x=x
=
npu ar a es:
1 2, ,..., K
, ,...,y y y=y
Output Variables:
4SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
5/33
When the relationships between all system variables
(input, output, and hidden) is completelyunderstood!
This is NOT the case for almost any real system!
5SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
6/33
-
Supervised Learning
Unsupervised Learning
6SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
7/33
Given: Training examples
1 1 2 2, , , ,..., ,P Px x x x x x
Find
Predict , where is not in the training set
f x
( ) =y f x x
7SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
8/33
,
Definition:A computer program is said to learn
from experience Ewith respect to some class of tasks T
and performance measure P,
if its performance at tasks in T, as measured by P, improveswith experience E.
Learned hypothesis: model of problem/task TModel quality: accuracy/performance measured by P
8SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
9/33
Data: experience E in the form of examples / instances
characteristic of the whole input space
independent and identically distributed (no bias in selection / observations)
oo examp e 1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) robabl i.i.d.
representative? if annotation is involved it is always a question of compromises
e n e y a examp e all abstracts that have John Smith as an author
9
Instances have to be comparable to each other
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
10/33
Example: set of queries and a set of top retrieved documents
(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each
top retrieved set is dependent on underlying IR system!
issues with representativeness, but forreranking this is fine
characterization is dependent on query (exc. PRank), i.e. only certain pairs (forthe same Q) are meaningfully comparable (c.f. independent examples for thesame Q)
we have to normalize the features per query to have same mean/variance
we have to form pairs and compare e.g. the diff of feature values
Toy example: Q = learning, rank 1: tf = 15, rank 100: tf = 2
10
Q = overfitting, rank 1: tf = 2, rank 10: tf = 0
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
11/33
The available examples (experience) has to be
described to the algorithm in a consumable format Here: examples are represented asvectorsof pre-defined features
E.g. forcredit risk assesment, typical features can be: income range,, , , ,
city of residence, etc.
Common feature t es
binary (criminal record, Y/N)
nominal cit of residence Xordinal (income range, 0-10K, 10-20K, )
11
,
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
12/33
CB GS REC
Experimental practice
by now youve learned what machine learning is; in the supervised approach youneed (carefully selected / prepared) examples that you describe through features;
the algorithm then learns a model of the problem based on the examples (usually,improvement is observed in terms of some performance measure
June 10, 2011
-
7/30/2019 Machine Learning Tutorial
13/33
2 kinds of arameters one the user sets for the training procedure in advance hyperparameter
the degree of polynom to match in regression
number/size of hidden layer in Neural Network
number of instances per leaf in decision tree
one that actually gets optimized through the training parameter
regression coefficients
network weights
size/depth of decision tree (in Weka, other implementations might allow to control that)
we usually do not talk about the latter, but refer to hyperparameters as parameters
Hyperparameters the less the algorithm has, the better
Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free
typically are set to optimize performance (on validation set, or through cross-validation)
manual, grid search, simulated annealing, gradient descent, etc.
13
common pitfall:
select the hyperparameters via CV, e.g. 10-fold + report cross-validation resultsSS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
14/33
- ,
{ }kk
xxX ,...,1=
2X
3X
5X
4X
1X
TestThe result is an averageover all iterations
Train
14SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
15/33
-
n- o : common prac ce or ma ng yper parame er es ma on morerobust
round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once)
the other way: random subsampling cross-validation
- , , . No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)
bad practice? problem: training sets largely overlap, test errors are also dependent
. .caution)
5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets
o ng v a na ura un s o process ng or e g ven as typically, document boundaries best practice is doing it yourself!
ML package / CSV representation is not aware of e.g. document boundaries!
15
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
16/33
-
Ideally the valid settings are:
take off-the-shelf algorithms, avoid parameter tuning and compare, . . -
n.b. you probably do the folding yourself, trying to minimize biases!
do parameter tuning (n.b. selecting/tuning your features is also tuning!)
but then normally you have to have a blind set (from the beginning) e.g. have a look at shared tasks, e.g. CoNLL practical way to learn
ex erimental best ractice to ali n the redefined standards ou mi ht evenbenefit from comparative results, etc.)
You might want to do something different
be aware of these & the conse uences
16
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
17/33
1. define the task
instance, target variable/labels, collect and label/annotate data cre t r s assessment: cre t request, goo a cre t, ~s ran out n t e
previous year
. ,
(development) ((test!)) / test(evaluation) data3. pick a learning algorithm (e.g. decision tree), train model train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use
pruning, ) according to performance on validation data test model accuracy on (blind) test set
4. read to use model to redict unseen instances with an ex ected
17
accuracy similar to that seen on test
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
18/33
Relation: segment
Instances: 1500Attributes: 20
. ,
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2Correctly Classified Instances 290 96.6667 %Incorrectly Classified Instances 10 3.3333 %
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12Correctly Classified Instances 281 93.6667 %
18
Incorrectly Classified Instances 19 6.3333 %
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
19/33
ng a po ynom a regress on:
0.0
1.0
t
M=0
0.0
1.0
t
M=1
M
By, for instance, least squares: 0.0 0.5 1.01.0
0.0 0.5 1.0
1.0
.
=
=n
n
nxxa0
)(
1.0 M=3 1.0 M=9
x
0.0t 0.0t
2
1 0
minarg = =
=l
j
M
n
n
nj xy
0.0 0.5 1.0
1.0
x
0.0 0.5 1.0
1.0
x
19SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
20/33
Important concept: discriminative power of the
algorithm linear vs nonlinear model
some theoretical aspects:
1-hidden-layer NN with unlimited hidden nodes canperfectly model any smooth function/surface
20SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
21/33
,has no (bad) generalization ability
results in high test error (useless model)
Underfitting: the model is not capable of learning the (complex)patterns in the training set
Reasons of Underfitting and Overfitting: lack of discriminative power
sma samp e s zenoise in the data /labels or features/
generalization ability of algorithmhas to be chosen wrt. sam le size
Size (complexity) of learnt modelgrows with data size
21
,
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
22/33
TP: p classified as p
FP: n classified as pTN: n classified as n
Good prediction:
TP+TNError:FP (false alarm) + FN (miss)
22SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
23/33
The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)
Error rate The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)
[Root]?[Mean|Absolute][Squared]?Error The difference between the predicted and actual values
e.g. =2))(( yxf
nRMSE
Algorithms (e.g. those in Weka) typically optimize these might be a mismatch between optimization objective and actual evaluation measure optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)
23SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
24/33
Fraction of correctly predicted positives and allpredicted positives
TP/ TP+FP
FP: n classified as p
TN: n
classified as n
Recall Fraction of correctl redicted ositives and all actual ositives
TP/(TP+FN)
F measure weighted harmonic mean of Precision and Recall (usually equal weighted, =1)
recallprecision
F
+= 22
)1(
Only makes sense for a subset of classes (usually measured for a single
24
For all classes, it equals the accuracy
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
25/33
, . . , , .A sequence of tokens with the same label is treated as a single instance
John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG_O _O _ORG.
Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conllevalfor NER
Example tagging: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG
_O _O _ORG.
Multiple penalty:, ,
2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG)
25
= . , = .
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
26/33
. . ,time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to thisfunction.
2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance, . , .
require humans in the loop.
3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,mean-average-precision. These require humans at the front of the loop, but after that arec eap an qu c . yp ca y some e or as een pu n o s ow ng corre a on e ween ese
and something higher up.4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for
parsing, chunking and named-entity recognition), alignment error rate (for word alignment)an perp ex y or anguage mo e ng . ese a so requ re umans a e ron o e oop,but differ from (3) in that they are not actually compared with higher-up tasks.
become disfunctional when you are optimizing them!
phrase P/R/F e.g. in NER
Readabilit measures
26
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
27/33
, . . , , . John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_Ojoining_O IBM_ORG.
Example tagging 1: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_Ojoining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG)
F(PER) = 0.67, F(ORG) = 0.5
Example tagging 2: o n_PER stu e _O at_O t e_O o ns_O op ns_O n vers ty_O e ore_O o n n g_O _ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP
1 FN: Johns Hopkins University (ORG) F(PER) = 1.0, F(ORG) = 0.67
Optimizing phrase-F can encourage / prefer systems that do not mark entities!
27
mos e y, s s a
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
28/33
ROC Receiver Operating Characteristic curve Curve that depicts the relation between recall (sensitivity) and false
-
Best case
all)
Worst case
ity(Rec
Sensiti
28False Positives FP / (FP+TN)
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
29/33
rea un er curve,
As you vary the decision threshold, you can plot the recall vs. false
ositive rate
The area under the curve measures how accurately your modelsepara es pos ve rom nega ves
perfect ranking: AUC = 1.0
random decision: AUC = 0.5
Similarly (e.g. in IR): area under P/R curve
w en ere are oo many rue nega ves
correctly identifying negatives is not interesting anyway
29SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
30/33
rec s on
number of true positives in top K predictions / ranks
MAP
The average of precisions computed at the point of each of the positives in theranked list (P=0 for positives not ranked at all)
For graded relevance / ranking
Highly relevant documents appearing lower in a search result list should bepenalized as the graded relevance value is reduced logarithmically proportionalto the position of the result.
30SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
31/33
easures ow e accuracy
erroro e mo e c anges w sample size
iteration number
Smaller sample worse accuracy
more likely bias in the estimate(representative sample)
variance in the estimate
If it looks differently:
you are plotting error vs. size/iteration
31
overfitting (iteration, not sample size)!
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
32/33
varying amount of training data (Banko & Brill, 2001): Winnow
nave Bayes memory-based learner
Features: bag of words:
words within a window of the
collocations containingspecific words and/or part of speech
Training corpus: 1-billion wordsfrom a variety of English texts(news articles, literature, scientific abstracts, etc.)
32SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |
-
7/30/2019 Machine Learning Tutorial
33/33
Su ervised learnin : based on a set of labeled exam les x fx learn the
input-output mapping, i.e. f(x)
3 factors of successful machine learning models much data
good features
well-suited learning algorithm
ML workflow1. problem definition
. , ,
3. selection of learning algorithm, (hyper)parameter tuning, training a final model
4. predict unseen examples & fill tables / draw figures for the paper - test
are u w t data representation (i.i.d, comparability, )
experimental setup (cross-validation, blind testing, )
33
a a s ze an a gor m se ec on over ng, un er ng,
evaluation measures
SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |