Intelligence Artificial Intelligence Ian Gent [email protected] Empirical Evaluation of AI...
-
Upload
johnathon-duell -
Category
Documents
-
view
214 -
download
0
Transcript of Intelligence Artificial Intelligence Ian Gent [email protected] Empirical Evaluation of AI...
Artificial IntelligenceIntelligence
Part I : Philosophy of Science Part II: Experiments in AIPart III: Basics of Experimental Design
with AI case studies
Empirical Evaluation of Computer Systems
3
Science as Refutation
Modern view of the progress of Science based on Popper. (Sir Karl Popper, that is)
A scientific theory is one that can be refuted I.e. it should make testable predictions
If these predictions are incorrect, the theory is false theory may still be useful, e.g. Newtonian physics
Therefore science is hypothesis testingArtificial intelligence aspires to be a science
4
Empirical Science
Empirical = “Relying upon or derived from observation or experiment”
Most (all) of Science is empirical. Consider theoretical computer science
study based on Turing machines, lambda calculus, etc Founded on empirical observation that computer systems
developed to date are Turing-complete Quantum computers might challenge this
if so, an empirically based theory of quantum computing will develop
5
Theory, not Theorems
Theory based science need not be all theorems otherwise science would be Mathematics
Compare Physics theory “QED” most accurate theory in the whole of science? based on a model of behaviour of particles predictions accurate to many decimal places (9?) success derived from accuracy of predictions
not the depth or difficulty or beauty of theorems I.e. QED is an empirical theory
AI/CS has too many theorems and not enough theory compare advice on how to publish in JACM
6
Empirical CS/AI
Computer programs are formal objects so some use only theory that can be proved by theorems but theorems are hard
Treat computer programs as natural objects like quantum particles, chemicals, living objects
perform empirical experiments We have a huge advantage over other sciences
no need for supercolliders (expensive) or animal experiments (ethical problems)
we should have complete command of experiments
7
What are our hypotheses?
My search program is better than yoursSearch cost grows exponentially with number of
variables for this kind of problemConstraint search systems are better at handling
overconstrained systems, but OR systems are better at handling underconstrained systems
My company should buy an AI search system rather than an OR one
8
Why do experiments?
Too often AI experimenters might talk like this: What is your experiment for? is my algorithm better than his? Why? I want to know which is faster Why? Lots of people use each kind … How will these people use your result?
?
9
Why do experiments?
Compare experiments on identical twins: What is your experiment for? I want to find out if twins reared apart to those reared together and
nonidentical twins too. Why? We can get estimates of the genetic and social contributors to
performance Why? Because the role of genetics in behavior is one of the great unsolved
questions.
Experiments should address research questions otherwise they can just be “track meets”
10
Basic issues in Experimental Design
From Paul R Cohen, Empirical Methods for Artificial Intelligence, MIT Press, 1995, Chapter 3
ControlCeiling and Floor effectsSampling Biases
11
Control
A control is an experiment in which the hypothesised variation does not occur so the hypothesised effect should not occur either
e.g. Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) macaques gained immunity from SIV
Later, macaques given uninfected human T-cells and macaques still gained immunity!
Control experiment not originally done and not always obvious (you can’t control for all variables)
12
Case Study: MYCIN
MYCIN was a medial expert system recommended therapy for blood/meningitis infections
How to evaluate its recommendations?Shortliffe used
10 sample problems 8 other therapy recommenders
5 faculty at Stanford Med. School, 1 senior resident, 1 senior postdoctoral researcher, 1 senior student
8 impartial judges gave 1 point per problem Max score was 80 Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30
13
Case Study: MYCIN
What were controls?Control for judge’s bias for/against computers
judges did not know who recommended each therapy
Control for easy problems medical student did badly, so problems not easy
Control for our standard being low e.g. random choice should do worse
Control for factor of interest e.g. hypothesis in MYCIN that “knowledge is power” have groups with different levels of knowledge
14
Ceiling and Floor Effects
Well designed experiments can go wrongWhat if all our algorithms do particularly well (or they all
do badly)? We’ve got little evidence to choose between themCeiling effects arise when test problems are
insufficiently challenging floor effects the opposite, when problems too challenging
A problem in AI because we often use benchmark setsBut how do we detect the effect?
15
Ceiling Effects: Machine Learning
14 datasets from UCI corpus of benchmarks used as mainstay of ML community
Problem is learning classification rules each item is vector of features and a classification measure classification accuracy of method (max 100%)
Compare C4 with 1R*, two competing algorithms:
DataSet: BC CH GL G2 HD HE … MeanC4 72 99.2 63.2 74.3 73.6 81.2 ... 85.91R* 72.5 69.2 56.4 77 78 85.1 ... 83.8
16
Ceiling Effects
DataSet: BC CH GL G2 HD HE … MeanC4 72 99.2 63.2 74.3 73.6 81.2 ... 85.91R* 72.5 69.2 56.4 77 78 85.1 ... 83.8Max 72.5 99.2 63.2 77 78 85.1 … 87.4
C4 achieves only about 2% better than 1R* If we take the best of the C4/1R* in each case, we can only achieve
87.4% accuracy We have only weak evidence that C4 better both methods performing near ceiling of possible Ceiling effect is that we can’t compare the two methods well
because both are achieving near the best practicable
17
Ceiling Effects
In fact 1R* only uses one feature (the best one)C4 uses on average 6.6 features5.6 features buy only about 2% improvementConclusion?
Either real world learning problems are easy (use 1R*) Or we need more challenging datasets We need to be aware of ceiling effects in results
18
Sampling Bias
Sampling bias is when data collection is biased against certain data e.g. teacher who says “Girls don’t answer maths question” observation might suggest that …
indeed girls don’t answer many questionsbut that the teacher doesn’t ask them many questions
Experienced AI researchers don’t do that, right?
19
Case Study: Phoenix
Phoenix = AI system to fight (simulated) forest firesExperiments suggested that wind speed uncorrelated with
time to put out fire obviously incorrect (high winds spread forest fires)
Wind Speed vs containment time (max 150 hours):3: 120 55 79 10 140 26 15 110 12
54 10 103 6: 78 61 58 81 71 57 21 32 709: 62 48 21 55 101What’s the problem?
20
Sampling bias in Phoenix
The cut-off of 150 hours introduces sampling biasMany high-wind fires get cut off, not many low windOn remaining data, there is no correlation between
wind speed and time (r = -0.53) In fact, data shows that:
a lot of high wind fires take > 150 hours to contain those that don’t are similar to low wind fires
You wouldn’t do this, right?You might if you had automated data analysis.