Intelligence Artificial Intelligence Ian Gent [email protected] Empirical Evaluation of AI...

Artificial IntelligenceIntelligence

Ian [email protected]

Empirical Evaluation of AI Systems

Artificial IntelligenceIntelligence

Part I : Philosophy of Science Part II: Experiments in AIPart III: Basics of Experimental Design

with AI case studies

Empirical Evaluation of Computer Systems

3

Science as Refutation

Modern view of the progress of Science based on Popper. (Sir Karl Popper, that is)

A scientific theory is one that can be refuted I.e. it should make testable predictions

If these predictions are incorrect, the theory is false theory may still be useful, e.g. Newtonian physics

Therefore science is hypothesis testingArtificial intelligence aspires to be a science

4

Empirical Science

Empirical = “Relying upon or derived from observation or experiment”

Most (all) of Science is empirical. Consider theoretical computer science

study based on Turing machines, lambda calculus, etc Founded on empirical observation that computer systems

developed to date are Turing-complete Quantum computers might challenge this

if so, an empirically based theory of quantum computing will develop

5

Theory, not Theorems

Theory based science need not be all theorems otherwise science would be Mathematics

Compare Physics theory “QED” most accurate theory in the whole of science? based on a model of behaviour of particles predictions accurate to many decimal places (9?) success derived from accuracy of predictions

not the depth or difficulty or beauty of theorems I.e. QED is an empirical theory

AI/CS has too many theorems and not enough theory compare advice on how to publish in JACM

6

Empirical CS/AI

Computer programs are formal objects so some use only theory that can be proved by theorems but theorems are hard

Treat computer programs as natural objects like quantum particles, chemicals, living objects

perform empirical experiments We have a huge advantage over other sciences

no need for supercolliders (expensive) or animal experiments (ethical problems)

we should have complete command of experiments

7

What are our hypotheses?

My search program is better than yoursSearch cost grows exponentially with number of

variables for this kind of problemConstraint search systems are better at handling

overconstrained systems, but OR systems are better at handling underconstrained systems

My company should buy an AI search system rather than an OR one

8

Why do experiments?

Too often AI experimenters might talk like this: What is your experiment for? is my algorithm better than his? Why? I want to know which is faster Why? Lots of people use each kind … How will these people use your result?

?

9

Why do experiments?

Compare experiments on identical twins: What is your experiment for? I want to find out if twins reared apart to those reared together and

nonidentical twins too. Why? We can get estimates of the genetic and social contributors to

performance Why? Because the role of genetics in behavior is one of the great unsolved

questions.

Experiments should address research questions otherwise they can just be “track meets”

10

Basic issues in Experimental Design

From Paul R Cohen, Empirical Methods for Artificial Intelligence, MIT Press, 1995, Chapter 3

ControlCeiling and Floor effectsSampling Biases

11

Control

A control is an experiment in which the hypothesised variation does not occur so the hypothesised effect should not occur either

e.g. Macaque monkeys given vaccine based on human T-cells infected with SIV (relative of HIV) macaques gained immunity from SIV

Later, macaques given uninfected human T-cells and macaques still gained immunity!

Control experiment not originally done and not always obvious (you can’t control for all variables)

12

Case Study: MYCIN

MYCIN was a medial expert system recommended therapy for blood/meningitis infections

How to evaluate its recommendations?Shortliffe used

10 sample problems 8 other therapy recommenders

5 faculty at Stanford Med. School, 1 senior resident, 1 senior postdoctoral researcher, 1 senior student

8 impartial judges gave 1 point per problem Max score was 80 Mycin: 65 Faculty 40-60, Fellow 60, Resident 45, Student 30

13

Case Study: MYCIN

What were controls?Control for judge’s bias for/against computers

judges did not know who recommended each therapy

Control for easy problems medical student did badly, so problems not easy

Control for our standard being low e.g. random choice should do worse

Control for factor of interest e.g. hypothesis in MYCIN that “knowledge is power” have groups with different levels of knowledge

14

Ceiling and Floor Effects

Well designed experiments can go wrongWhat if all our algorithms do particularly well (or they all

do badly)? We’ve got little evidence to choose between themCeiling effects arise when test problems are

insufficiently challenging floor effects the opposite, when problems too challenging

A problem in AI because we often use benchmark setsBut how do we detect the effect?

15

Ceiling Effects: Machine Learning

14 datasets from UCI corpus of benchmarks used as mainstay of ML community

Problem is learning classification rules each item is vector of features and a classification measure classification accuracy of method (max 100%)

Compare C4 with 1R*, two competing algorithms:

DataSet: BC CH GL G2 HD HE … MeanC4 72 99.2 63.2 74.3 73.6 81.2 ... 85.91R* 72.5 69.2 56.4 77 78 85.1 ... 83.8

16

Ceiling Effects

DataSet: BC CH GL G2 HD HE … MeanC4 72 99.2 63.2 74.3 73.6 81.2 ... 85.91R* 72.5 69.2 56.4 77 78 85.1 ... 83.8Max 72.5 99.2 63.2 77 78 85.1 … 87.4

C4 achieves only about 2% better than 1R* If we take the best of the C4/1R* in each case, we can only achieve

87.4% accuracy We have only weak evidence that C4 better both methods performing near ceiling of possible Ceiling effect is that we can’t compare the two methods well

because both are achieving near the best practicable

17

Ceiling Effects

In fact 1R* only uses one feature (the best one)C4 uses on average 6.6 features5.6 features buy only about 2% improvementConclusion?

Either real world learning problems are easy (use 1R*) Or we need more challenging datasets We need to be aware of ceiling effects in results

18

Sampling Bias

Sampling bias is when data collection is biased against certain data e.g. teacher who says “Girls don’t answer maths question” observation might suggest that …

indeed girls don’t answer many questionsbut that the teacher doesn’t ask them many questions

Experienced AI researchers don’t do that, right?

19

Case Study: Phoenix

Phoenix = AI system to fight (simulated) forest firesExperiments suggested that wind speed uncorrelated with

time to put out fire obviously incorrect (high winds spread forest fires)

Wind Speed vs containment time (max 150 hours):3: 120 55 79 10 140 26 15 110 12

54 10 103 6: 78 61 58 81 71 57 21 32 709: 62 48 21 55 101What’s the problem?

20

Sampling bias in Phoenix

The cut-off of 150 hours introduces sampling biasMany high-wind fires get cut off, not many low windOn remaining data, there is no correlation between

wind speed and time (r = -0.53) In fact, data shows that:

a lot of high wind fires take > 150 hours to contain those that don’t are similar to low wind fires

You wouldn’t do this, right?You might if you had automated data analysis.

Intelligence Artificial Intelligence Ian Gent [email protected] Empirical Evaluation of AI...

Documents

Transcript of Intelligence Artificial Intelligence Ian Gent [email protected] Empirical Evaluation of AI...