Artificial Intelligence, Lecture 7.3, Page 1
Transcript of Artificial Intelligence, Lecture 7.3, Page 1
![Page 1: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/1.jpg)
Learning Objectives
At the end of the class you should be able to:
show an example of decision-tree learning
explain how to avoid overfitting in decision-tree learning
explain the relationship between linear and logisticregression
explain how overfitting can be avoided
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 1
![Page 2: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/2.jpg)
Basic Models for Supervised Learning
Many learning algorithms can be seen as deriving from:
decision trees
linear (and non-linear) classifiers
Bayesian classifiers
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 2
![Page 3: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/3.jpg)
Learning Decision Trees
Representation is a decision tree.
Bias is towards simple decision trees.
Search through the space of decision trees, from simpledecision trees to more complex ones.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 3
![Page 4: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/4.jpg)
Decision trees
A (binary) decision tree (for a particular output feature) is atree where:
Each nonleaf node is labeled with an test (function ofinput features).
The arcs out of a node labeled with values for the test.
The leaves of the tree are labeled with point prediction ofthe output feature.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 4
![Page 5: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/5.jpg)
Example Decision Trees
knownunknown
follow_upnew
shortlong
Length
Thread
Author
skips
reads
skips reads
shortlong
Length
reads with probability 0.82
skips
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 5
![Page 6: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/6.jpg)
Equivalent Logic Program
skips ← long .
reads ← short ∧ new .
reads ← short ∧ follow up ∧ known.
skips ← short ∧ follow up ∧ unknown.
or with negation as failure:
reads ← short ∧ new .
reads ← short ∧ ∼new ∧ known.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 6
![Page 7: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/7.jpg)
Issues in decision-tree learning
Given some training examples, which decision tree shouldbe generated?
A decision tree can represent any discrete function of theinput features.
You need a bias. Example, prefer the smallest tree.Least depth? Fewest nodes? Which trees are the bestpredictors of unseen data?
How should you go about building a decision tree? Thespace of decision trees is too big for systematic search forthe smallest decision tree.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 7
![Page 8: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/8.jpg)
Searching for a Good Decision Tree
The input is a set of input features, a target feature and,a set of training examples.
Either:I Stop and return the a value for the target feature or a
distribution over target feature valuesI Choose a test (e.g. an input feature) to split on.
For each value of the test, build a subtree for thoseexamples with this value for the test.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 8
![Page 9: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/9.jpg)
Choices in implementing the algorithm
When to stop:
I no more input featuresI all examples are classified the sameI too few examples to make an informative split
Which test to split on isn’t defined. Often we usemyopic split: which single split gives smallest error.
With multi-valued features, the text can be can to spliton all values or split values into half. More complex testsare possible.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 9
![Page 10: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/10.jpg)
Choices in implementing the algorithm
When to stop:I no more input featuresI all examples are classified the sameI too few examples to make an informative split
Which test to split on isn’t defined. Often we usemyopic split: which single split gives smallest error.
With multi-valued features, the text can be can to spliton all values or split values into half. More complex testsare possible.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 10
![Page 11: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/11.jpg)
Choices in implementing the algorithm
When to stop:I no more input featuresI all examples are classified the sameI too few examples to make an informative split
Which test to split on isn’t defined. Often we usemyopic split: which single split gives smallest error.
With multi-valued features, the text can be can to spliton all values or split values into half. More complex testsare possible.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 11
![Page 12: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/12.jpg)
Example Classification Data
Training Examples:Action Author Thread Length Where
e1 skips known new long homee2 reads unknown new short worke3 skips unknown old long worke4 skips known old long homee5 reads known new short homee6 skips known old long workNew Examples:e7 ??? known new short worke8 ??? unknown new short work
We want to classify new examples on feature Action based onthe examples’ Author , Thread , Length, and Where.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 12
![Page 13: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/13.jpg)
Example: possible splits
length
long short
skips 7reads 0
skips 2reads 9
skips 9reads 9
thread
new old
skips 3reads 7
skips 6reads 2
skips 9reads 9
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 13
![Page 14: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/14.jpg)
Handling Overfitting
This algorithm can overfit the data.This occurs when
noise and correlations in the trainingset that are not reflected in the data as a whole.
To handle overfitting:I restrict the splitting, and split only when the split is
useful.I allow unrestricted splitting and prune the resulting tree
where it makes unwarranted distinctions.I learn multiple trees and average them.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 14
![Page 15: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/15.jpg)
Handling Overfitting
This algorithm can overfit the data.This occurs when noise and correlations in the trainingset that are not reflected in the data as a whole.
To handle overfitting:I restrict the splitting, and split only when the split is
useful.I allow unrestricted splitting and prune the resulting tree
where it makes unwarranted distinctions.I learn multiple trees and average them.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 15
![Page 16: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/16.jpg)
Linear Function
A linear function of features X1, . . . ,Xn is a function of theform:
f w (X1, . . . ,Xn) = w0 + w1X1 + · · ·+ wnXn
We invent a new feature X0 which has value 1, to make it nota special case.
f w (X1, . . . ,Xn) =n∑
i=0
wiXi
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 16
![Page 17: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/17.jpg)
Linear Regression
Aim: predict feature Y from features X1, . . . ,Xn.
A feature is a function of an example.Xi(e) is the value of feature Xi on example e.
Linear regression: predict a linear function of the inputfeatures.
Y w (e) = w0 + w1X1(e) + · · ·+ wnXn(e)
=n∑
i=0
wiXi(e) ,
Y w (e) is the predicted value for Y on example e.It depends on the weights w .
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 17
![Page 18: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/18.jpg)
Sum of squares error for linear regression
The sum of squares error on examples E for output Y is:
ErrorE (w) =∑e∈E
(Y (e)− Y w (e))2
=∑e∈E
(Y (e)−
n∑i=0
wiXi(e)
)2
.
Goal: find weights that minimize ErrorE (w).
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 18
![Page 19: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/19.jpg)
Finding weights that minimize ErrorE (w)
Find the minimum analytically.Effective when it can be done (e.g., for linear regression).
Find the minimum iteratively.Works for larger classes of problems.Gradient descent:
wi ← wi − η∂ErrorE (w)
∂wi
η is the gradient descent step size, the learning rate.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 19
![Page 20: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/20.jpg)
Finding weights that minimize ErrorE (w)
Find the minimum analytically.Effective when it can be done (e.g., for linear regression).
Find the minimum iteratively.Works for larger classes of problems.Gradient descent:
wi ← wi − η∂ErrorE (w)
∂wi
η is the gradient descent step size, the learning rate.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 20
![Page 21: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/21.jpg)
Linear Classifier
Assume we are doing binary classification, with classes{0, 1} (e.g., using indicator functions).
There is no point in making a prediction of less than 0 orgreater than 1.
A squashed linear function is of the form:
f w (X1, . . . ,Xn) = f (w0 + w1X1 + · · ·+ wnXn)
where f is an activation function .
A simple activation function is the step function:
f (x) =
{1 if x ≥ 00 if x < 0
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 21
![Page 22: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/22.jpg)
Linear Classifier
Assume we are doing binary classification, with classes{0, 1} (e.g., using indicator functions).
There is no point in making a prediction of less than 0 orgreater than 1.
A squashed linear function is of the form:
f w (X1, . . . ,Xn) = f (w0 + w1X1 + · · ·+ wnXn)
where f is an activation function .
A simple activation function is the step function:
f (x) =
{1 if x ≥ 00 if x < 0
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 22
![Page 23: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/23.jpg)
Linear Classifier
Assume we are doing binary classification, with classes{0, 1} (e.g., using indicator functions).
There is no point in making a prediction of less than 0 orgreater than 1.
A squashed linear function is of the form:
f w (X1, . . . ,Xn) = f (w0 + w1X1 + · · ·+ wnXn)
where f is an activation function .
A simple activation function is the step function:
f (x) =
{1 if x ≥ 00 if x < 0
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 23
![Page 24: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/24.jpg)
Error for Squashed Linear Function
The sum of squares error is:
ErrorE (w) =∑e∈E
(Y (e)− f (
∑i
wiXi(e))
)2
.
If f is differentiable, we can do gradient descent.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 24
![Page 25: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/25.jpg)
The sigmoid or logistic activation function
00.10.20.30.40.50.60.70.80.9
1
-10 -5 0 5 10
1
1 + e- x
f (x) =1
1 + e−x
f ′(x) = f (x)(1− f (x))
A logistic function is the sigmoid of a linear function.Logistic regression: find weights to minimise error of a logisticfunction.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 25
![Page 26: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/26.jpg)
The sigmoid or logistic activation function
00.10.20.30.40.50.60.70.80.9
1
-10 -5 0 5 10
1
1 + e- x
f (x) =1
1 + e−x
f ′(x) = f (x)(1− f (x))
A logistic function is the sigmoid of a linear function.Logistic regression: find weights to minimise error of a logisticfunction.
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 26
![Page 27: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/27.jpg)
Gradient Descent for Logistic Regression
1: procedure LogisticRegression(X ,Y ,E , η)2: • X : set of input features, X = {X1, . . . ,Xn}3: • Y : output feature4: • E : set of examples5: • η: learning rate6: initialize w0, . . . ,wn randomly7: repeat8: for each example e in E do9: p ← f (
∑i wiXi(e))
10: δ ← Y (e)− p11: for each i ∈ [0, n] do
12: wi ← wi + ηδp(1− p)Xi(e)
13: until some stopping criterion is true14: return w0, . . . ,wn
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 27
![Page 28: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/28.jpg)
Simple Example
new short home
reads
-0.7 -0.9 1.2
0.4
Ex new short home reads δ errorPredicted Obs
e1 0 0 0 f (0.4) = 0.6 0 −0.6 0.36e2 1 1 0
f (−1.2) = 0.23
0
−0.23 0.053
e3 1 0 1
f (0.9) = 0.71
1
0.29 0.084
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 28
![Page 29: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/29.jpg)
Simple Example
new short home
reads
-0.7 -0.9 1.2
0.4
Ex new short home reads δ errorPredicted Obs
e1 0 0 0 f (0.4) = 0.6 0 −0.6 0.36e2 1 1 0 f (−1.2) = 0.23 0
−0.23 0.053
e3 1 0 1 f (0.9) = 0.71 1
0.29 0.084
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 29
![Page 30: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/30.jpg)
Simple Example
new short home
reads
-0.7 -0.9 1.2
0.4
Ex new short home reads δ errorPredicted Obs
e1 0 0 0 f (0.4) = 0.6 0 −0.6 0.36e2 1 1 0 f (−1.2) = 0.23 0 −0.23 0.053e3 1 0 1 f (0.9) = 0.71 1 0.29 0.084
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 30
![Page 31: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/31.jpg)
Linearly Separable
A classification is linearly separable if there is ahyperplane where the classification is true on one side ofthe hyperplane and false on the other side.
For the sigmoid function, the hyperplane is when:
w0 + w1X1 + · · ·+ wnXn = 0
This separates the predictions > 0.5 and < 0.5.
linearly separable implies the error can be arbitrarily small
+ +
+-0 1
0
1or
- +
--0 1
0
1and
+ -
+-0 1
0
1xor
Kernel Trick: use functions of input features (e.g., product)
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 31
![Page 32: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/32.jpg)
Linearly Separable
A classification is linearly separable if there is ahyperplane where the classification is true on one side ofthe hyperplane and false on the other side.
For the sigmoid function, the hyperplane is when:
w0 + w1X1 + · · ·+ wnXn = 0
This separates the predictions > 0.5 and < 0.5.
linearly separable implies the error can be arbitrarily small
+ +
+-0 1
0
1or
- +
--0 1
0
1and
+ -
+-0 1
0
1xor
Kernel Trick: use functions of input features (e.g., product)
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 32
![Page 33: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/33.jpg)
Linearly Separable
A classification is linearly separable if there is ahyperplane where the classification is true on one side ofthe hyperplane and false on the other side.
For the sigmoid function, the hyperplane is when:
w0 + w1X1 + · · ·+ wnXn = 0
This separates the predictions > 0.5 and < 0.5.
linearly separable implies the error can be arbitrarily small
+ +
+-0 1
0
1or
- +
--0 1
0
1and
+ -
+-0 1
0
1xor
Kernel Trick: use functions of input features (e.g., product)c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 33
![Page 34: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/34.jpg)
Variants in Linear Separators
Which linear separator to use can result in various algorithms:
Perceptron
Logistic Regression
Support Vector Machines (SVMs)
. . .
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 34
![Page 35: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/35.jpg)
Bias in linear classifiers and decision trees
It’s easy for a logistic function to represent“at least two of X1, . . . ,Xk are true”:
w0 w1 · · · wk
-15 10 · · · 10
This concept forms a large decision tree.
Consider representing a conditional:“If X7 then X2 else X3”:
I Simple in a decision tree.I Complicated (possible?) for a linear separator
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 35
![Page 36: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/36.jpg)
Bias in linear classifiers and decision trees
It’s easy for a logistic function to represent“at least two of X1, . . . ,Xk are true”:
w0 w1 · · · wk
-15 10 · · · 10
This concept forms a large decision tree.
Consider representing a conditional:“If X7 then X2 else X3”:
I Simple in a decision tree.I Complicated (possible?) for a linear separator
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 36
![Page 37: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/37.jpg)
Bayesian classifiers
Idea: if you knew the classification you could predict thevalues of features.
P(Class|X1 . . .Xn) ∝ P(X1, . . . ,Xn|Class)P(Class)
Naive Bayesian classifier: Xi are independent of eachother given the class.Requires: P(Class) and P(Xi |Class) for each Xi .
P(Class|X1 . . .Xn) ∝∏i
P(Xi |Class)P(Class)
UserAction
Author Thread Length Where Read
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 37
![Page 38: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/38.jpg)
Learning Probabilities
X1 X2 X3 X4 C Count...
......
......
...t f t t 1 40t f t t 2 10t f t t 3 50...
......
......
...
−→
C
X1 X2 X3 X4
P(C=vi) =
∑t|=C=vi
Count(t)∑t Count(t)
P(Xk = vj |C=vi) =
∑t|=C=vi∧Xk=vj
Count(t)∑t|=C=vi
Count(t)
...perhaps including pseudo-counts
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 38
![Page 39: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/39.jpg)
Learning Probabilities
X1 X2 X3 X4 C Count...
......
......
...t f t t 1 40t f t t 2 10t f t t 3 50...
......
......
...
−→
C
X1 X2 X3 X4
P(C=vi) =
∑t|=C=vi
Count(t)∑t Count(t)
P(Xk = vj |C=vi) =
∑t|=C=vi∧Xk=vj
Count(t)∑t|=C=vi
Count(t)
...perhaps including pseudo-countsc©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 39
![Page 40: Artificial Intelligence, Lecture 7.3, Page 1](https://reader031.fdocuments.us/reader031/viewer/2022020706/61fc30765561794f360ed028/html5/thumbnails/40.jpg)
Help System
H
"able" "absent" "add" "zoom". . .
The domain of H is the set of all help pages.The observations are the words in the query.
What probabilities are needed?What pseudo-counts and counts are used?What data can be used to learn from?
c©D. Poole and A. Mackworth 2010 Artificial Intelligence, Lecture 7.3, Page 40