Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning...

Kansas State University

Department of Computing and Information SciencesCIS 732: Machine Learning and Pattern Recognition

Wednesday, 27 February 2008

William H. Hsu

Department of Computing and Information Sciences, KSUhttp://www.kddresearch.org

http://www.cis.ksu.edu/~bhsu

Readings:

Section 6.11, Han & Kamber 2e

Chapter 1, Sections 6.1-6.5, Goldberg

Sections 9.1-9.4, Mitchell

Regression and Prediction

Lecture 16 of 42Lecture 16 of 42



Lecture OutlineLecture Outline

• Readings

– Section 6.11, Han & Kamber 2e

– Suggested: Chapter 1, Sections 6.1-6.5, Goldberg

• Paper Review: “Genetic Algorithms and Classifier Systems”, Booker et al

• Evolutionary Computation

– Biological motivation: process of natural selection

– Framework for search, optimization, and learning

• Prototypical (Simple) Genetic Algorithm

– Components: selection, crossover, mutation

– Representing hypotheses as individuals in GAs

• An Example: GA-Based Inductive Learning (GABIL)

• GA Building Blocks (aka Schemas)

• Taking Stock (Course Review): Where We Are, Where We’re Going



– Working with relationships between two variables• Size of Teaching Tip & Stats Test Score

0

10

20

30

40

50

60

70

80

90

100

$0 $20 $40 $60 $80

StatsTestScore

© 2005 Sinn, J.Winthrop University



Correlation & RegressionCorrelation & Regression

• Univariate & Bivariate Statistics – U: frequency distribution, mean, mode, range, standard deviation– B: correlation – two variables

• Correlation– linear pattern of relationship between one variable (x) and another variable (y) – an

association between two variables– relative position of one variable correlates with relative distribution of another

variable– graphical representation of the relationship between two variables

• Warning: – No proof of causality– Cannot assume x causes y




Scatterplot!Scatterplot!

• No Correlation– Random or circular assortment of

dots

• Positive Correlation– ellipse leaning to right– GPA and SAT– Smoking and Lung Damage

• Negative Correlation– ellipse learning to left– Depression & Self-esteem– Studying & test errors




Pearson’s Correlation CoefficientPearson’s Correlation Coefficient

• “r” indicates…– strength of relationship (strong, weak, or none)– direction of relationship

• positive (direct) – variables move in same direction• negative (inverse) – variables move in opposite directions

• r ranges in value from –1.0 to +1.0

Strong Negative No Rel. Strong Positive-1.0 0.0 +1.0

•Go to website!–playing with scatterplots




Practice with ScatterplotsPractice with Scatterplots

r = .__ __

r = .__ __

r = .__ __

r = .__ __© 2005 Sinn, J.Winthrop University



Correlation GuestimationCorrelation Guestimation




Correlations

1 -.797** -.800** -.774**

.002 .002 .003

12 12 12 12

-.797** 1 .648* .780**

.002 .023 .003

12 12 12 12

-.800** .648* 1 .753**

.002 .023 .005

12 12 12 12

-.774** .780** .753** 1

.003 .003 .005

12 12 12 12

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Miles walked per day

Weight

Depression

Anxiety

Miles walkedper day Weight Depression Anxiety

Correlation is significant at the 0.01 level (2-tailed).**.

Correlation is significant at the 0.05 level (2-tailed).*.




Samples vs. PopulationsSamples vs. Populations

• Sample statistics estimate Population parameters– M tries to estimate μ – r tries to estimate ρ (“rho” – greek symbol --- not “p”)

• r correlation for a sample

• based on a the limited observations we have

• ρ actual correlation in population • the true correlation

• Beware Sampling Error!!– even if ρ=0 (there’s no actual correlation), you might get r =.08 or r = -.26 just by

chance.– We look at r, but we want to know about ρ




Hypothesis testing with CorrelationsHypothesis testing with Correlations

• Two possibilities– Ho: ρ = 0 (no actual correlation; The Null Hypothesis)– Ha: ρ ≠ 0 (there is some correlation; The Alternative Hyp.)

• Case #1 (see correlation worksheet)

– Correlation between distance and points r = -.904– Sample small (n=6), but r is very large– We guess ρ < 0 (we guess there is some correlation in the pop.)

• Case #2– Correlation between aiming and points, r = .628– Sample small (n=6), and r is only moderate in size– We guess ρ = 0 (we guess there is NO correlation in pop.)

• Bottom-line– We can only guess about ρ – We can be wrong in two ways




Reading Correlation MatrixReading Correlation Matrix

Correlationsa

1 -.904* -.582 .628 .821* -.037 -.502

. .013 .226 .181 .045 .945 .310

6 6 6 6 6 6 6

-.904* 1 .279 -.653 -.883* .228 .522

.013 . .592 .159 .020 .664 .288

6 6 6 6 6 6 6

-.582 .279 1 -.390 -.248 -.087 .267

.226 .592 . .445 .635 .869 .609

6 6 6 6 6 6 6

.628 -.653 -.390 1 .758 -.546 -.250

.181 .159 .445 . .081 .262 .633

6 6 6 6 6 6 6

.821* -.883* -.248 .758 1 -.553 -.101

.045 .020 .635 .081 . .255 .848

6 6 6 6 6 6 6

-.037 .228 -.087 -.546 -.553 1 -.524

.945 .664 .869 .262 .255 . .286

6 6 6 6 6 6 6

-.502 .522 .267 -.250 -.101 -.524 1

.310 .288 .609 .633 .848 .286 .

6 6 6 6 6 6 6

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Sig. (2-tailed)

N

Total ball toss points

Distance from target

Time spun beforethrowing

Aiming accuracy

Manual dexterity

College grade point avg

Confidence for task

Total balltoss points

Distancefrom target

Time spunbefore

throwingAiming

accuracyManual

dexterityCollege grade

point avgConfidence

for task

Correlation is significant at the 0.05 level (2-tailed).*.

Day sample collected = Tuesdaya.

6 6

-.904* 1

.013 .

6 6

-.582 .279

N

Pearson Correlation

Sig. (2-tailed)

N

Pearson Correlation

Total ball toss points


Time spun beforethrowing

r = -.904

p = .013 -- Probability of getting a correlation this size by sheer chance. Reject Ho if p ≤ .05.

sample size r (4) = -.904, p.05© 2005 Sinn, J.

Winthrop University



Predictive PotentialPredictive Potential

• Coefficient of Determination– r² – Amount of variance accounted for in y by x– Percentage increase in accuracy you gain by using the regression line to make

predictions – Without correlation, you can only guess the mean of y– [Used with regression]

20%0% 80% 100%60%40%




Limitations of CorrelationLimitations of Correlation

• linearity: – can’t describe non-linear relationships– e.g., relation between anxiety & performance

• truncation of range: – underestimate stength of relationship if you can’t see full range of x value

• no proof of causation– third variable problem:

• could be 3rd variable causing change in both variables• directionality: can’t be sure which way causality “flows”




RegressionRegression

• Regression: Correlation + Prediction– predicting y based on x– e.g., predicting….

• throwing points (y) • based on distance from target (x)

• Regression equation – formula that specifies a line– y’ = bx + a– plug in a x value (distance from target) and predict y (points)– note

• y= actual value of a score • y’= predict value

•Go to website!–Regression Playground© 2005 Sinn, J.

Winthrop University




2624222018161412108

Tota

l ba

ll to

ss p

oin

ts

120

100

80

60

40

20

0 Rsq = 0.6031

Regression Graphic – Regression LineRegression Graphic – Regression Line

if x=18 then…

y’=47

if x=24 then…

y’=20

See correlation & regression worksheet




Regression EquationRegression Equation

• y’= bx + a– y’ = predicted value of y– b = slope of the line– x = value of x that you plug-in– a = y-intercept (where line crosses y access)

• In this case….– y’ = -4.263(x) + 125.401

• So if the distance is 2020 feet– y’ = -4.263(2020) + 125.401– y’ = -85.26 + 125.401

– y’ = 40.141





SPSS Regression Set-upSPSS Regression Set-up

•“Criterion,” •y-axis variable, •what you’re trying to predict

•“Predictor,” •x-axis variable, •what you’re basing the prediction on

Note: Never refer to the IV or DV when doing regression




Getting Regression Info from SPSSGetting Regression Info from SPSS

Model Summary

.777a .603 .581 18.476Model1

R R SquareAdjustedR Square

Std. Error ofthe Estimate

Predictors: (Constant), Distance from targeta.

Coefficientsa

125.401 14.265 8.791 .000

-4.263 .815 -.777 -5.230 .000

(Constant)


Model1

B Std. Error

UnstandardizedCoefficients

Beta

StandardizedCoefficients

t Sig.

Dependent Variable: Total ball toss pointsa.

y’ = b (xx) + a

y’ = -4.263(2020) + 125.401


a

b© 2005 Sinn, J.Winthrop University



Predictive AbilityPredictive Ability

• Mantra!! – As variability decreases, prediction accuracy ___– if we can account for variance, we can make better predictions

• As r increases:– r² increases

• “variance accounted for” increases• the prediction accuracy increases

– prediction error decreases (distance between y’ and y)– Sy’ decreases

• the standard error of the residual/predictor• measures overall amount of prediction error

• We like big r’s!!!




Drawing a Regression Line by HandDrawing a Regression Line by Hand

Three steps

1. Plug zero in for x to get a y’ value, and then plot this value

– Note: It will be the y-intercept

2. Plug in a large value for x (just so it falls on the right end of the graph), plug it in for x, then plot the resulting point

3. Connect the two points with a straight line!




Time Series PredictionTime Series PredictionForecasting the Future andForecasting the Future and

Understanding the PastUnderstanding the PastSanta Fe Institute Proceedings on the Studies in the Sciences of ComplexitySanta Fe Institute Proceedings on the Studies in the Sciences of Complexity

Edited by Andreas Weingend and Neil GershenfeldEdited by Andreas Weingend and Neil Gershenfeld

NIST Complex System ProgramNIST Complex System ProgramPerspectives on Standard Benchmark DataPerspectives on Standard Benchmark Data

In Quantifying Complex SystemsIn Quantifying Complex Systems

Vincent StanfordVincent Stanford

Complex Systems Test Bed projectComplex Systems Test Bed project

August 31, 2007August 31, 2007



Chaos in Nature, Theory, and TechnologyChaos in Nature, Theory, and Technology

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

Rings of SaturnRings of Saturn Lorentz AttractorLorentz Attractor Aircraft dynamics at Aircraft dynamics at high angles of attackhigh angles of attackAircraft dynamics at Aircraft dynamics at high angles of attackhigh angles of attack



Time Series PredictionTime Series Prediction A Santa Fe Institute competition using standard data setsA Santa Fe Institute competition using standard data sets

• Santa Fe Institute (SFI) founded in 1984 to “… focus the tools of traditional scientific disciplines and emerging computer resources on … the multidisciplinary study of complex systems…”

• “This book is the result of an unsuccessful joke. … Out of frustration with the fragmented and anecdotal literature, we made what we thought was a humorous suggestion: run a competition. …no one laughed.”

• Time series from physics, biology, economics, …, beg the same questions:– What happens next?– What kind of system produced this time series?– How much can we learn about the producing system?

• Quantitative answers can permit direct comparisons• Make some standard data sets in consultation with subject matter experts in a variety of areas.• Very NISTY; but we are in a much better position to do this in the age of Google and the Internet.



Selecting benchmark data setsSelecting benchmark data setsFor inclusion in the bookFor inclusion in the book

• Subject matter expert advisor group:– Biology– Economics– Astrophysics– Numerical Analysis– Statistics– Dynamical Systems– Experimental Physics



The Data SetsThe Data Sets

A. Far-infrared laser excitation

B. Sleep Apnea

C. Currency exchange rates

D. Particle driven in nonlinear multiple well

potentials

E. Variable star data

F. J. S. Bach fugue notes



J.S. Bach J.S. Bach benchmarkbenchmark

• Dynamic, yes.• But is it an iterative map?• Is it amenable to time delay

embedding?



Competition TasksCompetition Tasks

• Predict the withheld continuations of the data sets provided for training and measure errors

• Characterize the systems as to:– Degrees of Freedom

– Predictability

– Noise characteristics

– Nonlinearity of the system

• Infer a model for the governing equations• Describe the algorithms employed



Complex Time Series Benchmark TaxonomyComplex Time Series Benchmark Taxonomy

• Natural• Stationary• Low dimensional• Clean• Short• Documented• Linear• Scalar• One trial

• Continuous

• Synthetic• Nonstationary• Stochastic• Noisy• Long• Blind• Nonlinear• Vector• Many trials

• Discontinuous• Switching• Catastrophes• Episodes

• Synthetic• Nonstationary• Stochastic• Noisy• Long• Blind• Nonlinear• Vector• Many trials

• Discontinuous• Switching• Catastrophes• Episodes



Time honored linear modelsTime honored linear models

• Auto Regressive Moving Average (ARMA)• Many linear estimation techniques based on Least Squares, or Least Mean Squares• Power spectra, and Autocorrelation characterize such linear systems• Randomness comes only from forcing function x(t)

y[t 1] ai y[t i]i0

NAR

b j

j0

N MA

x[t i]



SimpleSimple nonlinear systems nonlinear systemscan exhibit chaotic behaviorcan exhibit chaotic behavior

• Spectrum, autocorrelation, characterize linear systems, not these

• Deterministic chaos looks random to linear analysis methods

• Logistic map is an early example (Elam 1957).

x[t 1] rx[t](1 x[t])

Logisic map 2.9 < r < 3.99



Understanding and learningUnderstanding and learningcomments from SFIcomments from SFI

• Weak to Strong models - many parameters to few• Data poor to data rich • Theory poor to theory rich• Weak models progress to strong, e.g. planetary motion:

– Tycho Brahe: observes and records raw data– Kepler: equal areas swept in equal time – Newton: universal gravitation, mechanics, and calculus– Poincaré: fails to solve three body problem– Sussman and Wisdom: Chaos ensues with computational solution!

• Is that a simplification?



Discovering properties of dataDiscovering properties of dataand inferring (complex) modelsand inferring (complex) models

• Can’t decompose an output into the product of input and transfer function Y(z)=H(z)X(z) by doing a Z, Laplace, or Fourier transform.

• Linear Perceptrons were shown to have severe limitations by Minsky and Papert• Perceptrons with non-linear threshold logic can solve XOR and many classifications not available with

linear version• But according to SFI: “Learning XOR is as interesting as memorizing the phone book. More interesting -

and more realistic - are real-world problems, such as prediction of financial data.”• Many approaches are investigated



Time delay embeddingTime delay embeddingDiffers from traditional experimental measurementsDiffers from traditional experimental measurements

– Provides detailed information about degrees of freedom beyond the scalar measured

– Rests on probabilistic assumptions - though not guaranteed to be valid for any particular system

– Reconstructed dynamics are seen through an unknown “smooth transformation”– Therefore allows precise questions only about invariants under “smooth

transformations”– It can still be used for forecasting a time series and “characterizing essential

features of the dynamics that produced it”



Time delay embedding theoremsTime delay embedding theorems“The most important Phase Space Reconstruction technique is the method of delays”“The most important Phase Space Reconstruction technique is the method of delays”

– Assuming the dynamics f(X) on a V dimensional manifold has a strange attractor A with box counting dimension dA

– s(X) is a twice differentiable scalar measurement giving {sn}={s(Xn)} – M is called the embedding dimension is generally referred to as the delay, or lag – Embedding theorems: if {sn} consists of scalar measurements of the state a dynamical system then, under suitable

hypotheses, the time delay embedding {Sn} is a one-to-one transformed image of the {Xn}, provided M > 2dA. (e.g. Takens 1981, Lecture Notes in Mathematics, Springer-Verlag; or Sauer and Yorke, J. of Statistical Physics, 1991)

S n (sn (M 1) ,sn (M 2) ,...,sn )

S n (sn (M 1) ,sn (M 2) ,...,sn )

x n1

f (

x n )

x n1

f (

x n )

{sn} {s(x n )}

{sn} {s(x n )}Vector Vector

SequenceSequenceScalarScalarMeasurementMeasurement

Time delayTime delayVectorsVectors



Time series predictionTime series predictionMany different techniques thrown at the data to “see if anything sticks”Many different techniques thrown at the data to “see if anything sticks”

Examples:

– Delay coordinate embedding - Short term prediction by filtered delay coordinates and reconstruction with local linear models of the attractor (T. Sauer).

– Neural networks with internal delay lines - Performed well on data set A (E. Wan), (M. Mozer)– Simple architectures for fast machines - “Know the data and your modeling technique” (X.

Zhang and J. Hutchinson)– Forecasting pdf’s using HMMs with mixed states - Capturing “Embedology” (A. Frasar and A.

Dimiriadis)– More…



Time series characterizationTime series characterizationMany different techniques thrown at the data to “see if anything sticks”Many different techniques thrown at the data to “see if anything sticks”

Examples:

– Stochastic and deterministic modeling - Local linear approximation to attractors (M. Kasdagali and A. Weigend)

– Estimating dimension and choosing time delays - Box counting (F. Pineda and J. Sommerer)– Quantifying Chaos using information-theoretic functionals - mutual information and

nonlinearity testing.(M. Palus)– Statistics for detecting deterministic dynamics - Course grained flow averages (D. Kaplan)– More…



What to make of this?What to make of this?Handbook for the corpus driven study of nonlinear dynamicsHandbook for the corpus driven study of nonlinear dynamics

Very NISTY:– Convene a panel of leading researchers– Identify areas of interest where improved characterization and predictive measurements can

be of assistance to the community– Identify standard reference data sets:

• Development corpra• Test sets

– Develop metrics for prediction and characterization– Evaluate participants– Is there a sponsor? – Are there areas of special importance to communities we know? For example: predicting

catastrophic failures of machines from sensors.



Ideas?Ideas?











TerminologyTerminology

• Evolutionary Computation (EC): Models Based on Natural Selection

• Genetic Algorithm (GA) Concepts– Individual: single entity of model (corresponds to hypothesis)

– Population: collection of entities in competition for survival

– Generation: single application of selection and crossover operations

– Schema aka building block: descriptor of GA population (e.g., 10**0*)

– Schema theorem: representation of schema proportional to its relative fitness

• Simple Genetic Algorithm (SGA) Steps– Selection

• Proportionate reproduction (aka roulette wheel): P(individual) f(individual)

• Tournament: let individuals compete in pairs or tuples; eliminate unfit ones

– Crossover

• Single-point: 11101001000 00001010101 { 11101010101, 00001001000 }

• Two-point: 11101001000 00001010101 { 11001011000, 00101000101 }

• Uniform: 11101001000 00001010101 { 10001000100, 01101011001 }

– Mutation: single-point (“bit flip”), multi-point



Summary PointsSummary Points

• Evolutionary Computation

– Motivation: process of natural selection

• Limited population; individuals compete for membership

• Method for parallelizing and stochastic search

– Framework for problem solving: search, optimization, learning

• Prototypical (Simple) Genetic Algorithm (GA)

– Steps

• Selection: reproduce individuals probabilistically, in proportion to fitness

• Crossover: generate new individuals probabilistically, from pairs of “parents”

• Mutation: modify structure of individual randomly

– How to represent hypotheses as individuals in GAs

• An Example: GA-Based Inductive Learning (GABIL)

• Schema Theorem: Propagation of Building Blocks

• Next Lecture: Genetic Programming, The Movie

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning...

Documents

Transcript of Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning...