Learning - Michigan State Universityfcdyer/ZOL867/ZOL867Learning.pdf · Transitions in Learning...

Learning

Learning Defined• (Adaptive) change in behavior as a result of experience

This view is neutral with respect to underlying mechanisms,whether in forward- or reverse-engineering approach

• Acquisition of information/knowledge throughexperience/observation (leading to improved performanceor decision-making)Here the focus is more explicitly on the mechanisms bywhich the information is encoded and used

Learning Topics

Topics• History (beginning 20th century)• Associative learning

•What it explains•What it doesn’t•Can S-R and representational accounts be reconciled?

• General-process vs. specialized learning mechanisms• Active learning• Learning, adaptation and evolution (Thursday)• Natural and artificial systems (Next week)

A bit of history…

Learning and Behaviorism• Behaviorism is both a philosophy and a method• Behaviorism focuses on observable relationship between experience

(input) and behavior (output)• Most behavior develops through associative learning (Pavlovian and

Operant conditioning)

Radical Behaviorism (Skinner)• Only behavior (and antecedent causes in environment) can be studied

objectively; hypotheses about internal events are unscientific• By extension, hypotheses about innate processes are unscientific

(because antecedent causes prior to birth of animal can’t be observed).• Radical Behaviorism derives from Empiricism

A little more history…

The (slow) demise of Radical Behaviorism• Chomsky: human language can’t be accounted for by

Behaviorist/Empiricist theories based only on stimulus, response andreinforcement experienced by learner--we need to posit internal(innate?) grammatical rules that generate novel, never-experiencedlinguistic structures/knowledge

• AI (Newell and Simon)• Ethology/Neurobiology: vindication of “Nativist” view of behavioral

development

Definition: behavior changes as result of experiencing association between twoevents (E1 and E2)

Classical (Pavlovian) conditioning:

Instrumental/Operant/Trial-and-error conditioning:

•E2 (intrinsically meaningful stimulus)--leads to reflexive response•E1 (arbitrary stimulus)--comes to trigger response if experienced prior to E2

E1 E2 ResponseBell Food Salivation (unconditioned)Bell Salivation (conditioned)

•E2: an intrinsically meaningful stimulus (e.g. food or pain)•E1: initially arbitrary action--strengthened or weakened when associated with E2

E1(action) E2(reinforcement) Press Bar Food

reinforcement

Operant conditioning exemplifies Thorndike’s “Law of Effect”

Associative Learning

Traditional Learning Theory(through 1960s)

• Focus was on discovering general “laws of learning”• Order in which E1 and E2 must occur• Effects of salience• Effects of time delay• Effects of combining stimuli• Extinction of responses

• Avoided speculation about underlying mechanisms• But implication was that there were certain “general

processes” involving strengthening of S-R connections• Complex behavior entailed chaining of S-R associations• “Knowledge” consists of chains of S-S associations• “Mind” is a huge look-up table

Transitions in Learning Theory• Associative Learning not all-encompassing

• Language• “Imitative” learning (e.g., bird song)• “Latent” learning (e.g., spatial exploration in absence of

reward)• Even where associative learning operates, General Process

assumption crumbled• Different “laws” for different learned behaviors, e.g.,

associability of different events varies according to theproblem

• Species differences in what can be learned and when• Dissociations in human and animal learning

Specializations: example from Animal LearningMeadow voles: males “polygynous,” and have larger home range than femalesPrairie voles: monogamous, and male and female home ranges are similar

In meadow voles, male spatial learning is better than that of femalesIn prairie voles, spatial learning is similar in the two sexes

Specializations: Ideas from Cognitive Psychology

Distinctions proposed for different systems oflearning/memory• Short-term vs. Long-term• Explicit vs. Implicit (roughly, conscious vs. unconscious)• Declarative (“knowing that”) vs. Procedural (“knowing

how”)• Instance-based vs. Rule-based• Episodic vs. Semantic (in re autobiographical knowledge of

past)

Episodic Memory in Animals?Do scrub jays have episodic memory?

http://freespace.virgin.net/cliff.buckton/Birding/California/Calif17.jpg

Clayton & Dickinson 1998

Where did I put that wormAnd when did I put it there?

Endel Tulving (2001):Episodic memory entails conscious recallof autobiographical information, and theonly evidence we have of consciousrecall is a subject’s verbal report

Associative Learning and the“Representational Theory of Mind”

S-R and Cognitive (representational) accounts are oftenpitted against one another. Can they be reconciled?

• Associative learning may be involved in formationof complex representations (elements of experiencebound together because of spatial-temporalcontiguity)

• Internal representation of an important aspect ofexperience may function as input to associativelearning mechanism

Active Learning of Landmarks

Definition of landmark: any feature reliably associated with goalWhat makes a good landmark?

• For long distance guidance: distant, low motion parallax (e.g.,sun, distant mountaints

• For pinpointing a location: nearby, high motion parallax

Many animals, both vertebrate andinvertebrate, have been shown to prefernearby landmarks to learn locations: how dothey figure out which ones are nearby?

How to identify landmark (big concern in robotics)?• Static cues: color, contrast, symmetry, persistence over time• Motion cues (to identify nearby features: TBL• Association with context or goals

Active Vision and Landmark Learning:Segmenting the scene into near and far elements

Flying insects actively examinescene around goal

Voss R (1995) Information durch Eigenbewegung: Rekonstruktion und Analyse des Bildflussesam Auge fliegender Insekten. Doctoral Thesis, Universität Tübingen

Flight path generatesmotion signals that mayallow insect to pick outnearby (and henceuseful) landmarksagainst background

A View From The Wasp’s “Cockpit”

Reconstructed scene

Motion in scene

Food

Bee’s pathas shedeparts food

Eckert, M.P. & J Zeil 2001. Toward an ecology ofmotion vision. In: Motion Vision - Computational, Neural,and Ecological Constraints Ed. by J.M. Zanker and J. ZeilSpringer Verlag, Berlin Heidelberg New York

Computational Approaches to Learning

In Psychology:• Motivated by the observation of “learning curves” showing characteristic

quantitative relationship between experience and performance• Role of models

• Predict behavior/performance• Test hypotheses about mechanisms

Rescorla-Wagner

∆Vcs = c (Vmax – Vnet)Change inassociativestrength

Learning rate(a function ofsalience orlearnability ofUS and CS)

Maximum associativestrength (a function ofstrength of UR, or time lagsbetween CS and US

Currentassociativestrength

Measures“surprisingness”of event

Computational Approaches to Learning-cont’d

Rescorla-Wagner explains a lot• Shapes of learning curves• Extinction (waning of associative strength following non-rewarding

trials• Responses to compound stimuli

• Overshadowing Blocking

• Many other things• Integrating associative and representational approaches

Train: [AX] -> USTest: [X] -> US(weakened response to X byitself, which isn’t as predictiveas AX)

Pretrain: A -> USTrain: [AX] -> USTest: [X] -> US(no response to X: it adds nopredictive value)

Computational Approaches to Learning-cont’d

Rescorla-Wagner also fails to explainsome things

• Recovery from extinction (spontaneousor stimulus-triggered)

• Second-order conditioning

Train: CS1 -> USTrain: CS2 -> CS1Rescorla-Wagner can’t explainthis, because CS2 never getspaired with original reinforcer

Computational Approaches to Learning

In Computer Science/AI• Learning is considered a good solution to certain

problems• Classifying complex patterns• Making predictions given uncertainty in environment• Conferring autonomy on devices

• Making machines that learn entails a clear specificationof the problem• State space• Action space• Performance metric for evaluating different responses(“utility function”)

• Goal: compute optimal “policy” (mapping from stateto action)

Reinforcement Learning

Defines a class of problems• Agent learns from its own experience in environment, rather than

from supervised teaching• High degree of uncertainty (resulting from environmental

unpredictability• Animal improves based upon receipt of “rewards,” but may not be

rewarded until a sequence of actions has been completed• Thus, there is a problem of assigning credit to actions that haveworked

• Also there is a problem of generalizing to new situations• Also there is problem of learning about features of environment notyet experienced

A Grid-World (Spatial Cognition) Example

obstacle

+1

-1

3

2

1

1 2 3 4

terminal states

-0.04 -0.04 -0.04

-0.04

-0.04-0.04-0.04

-0.04

-0.04

+1

-1

+1

-1

First trial

Second trial

Total PayoffTrial 1: -1.16Trial 2: +0.80

While wandering throughenvironment, agent mayexperience total payoffs thatare very bad or very good

• Numbers show immediate payoff in each state.• Arrows show an “optimal policy”, i.e., one that

will give maximum total payoff if followed fromeach state

Temporal Difference Methods

Definition: A set of algorithms whereby anagent could be programmed to predict ineach state the reward received in the nextstate (or upon taking a given action)• TD methods are closely related toRescorla-Wagner model

• Can deal with second-order conditioning,whereby a state gains value by an indirect(sequential) association with reward state

• Goal: learn the true value of each state,assuming the optimal policy is followed

obstacle

+1

-1

3

2

1

1 2 3 4

-0.04 -0.04 -0.04

-0.04

-0.04-0.04-0.04

-0.04

-0.04


The objective is to learn an estimate of the utility of allstates. The utility is the expected sum of rewards fromthis state on.

Key idea: Use insight from dynamic programming toadjust the utility of a state based on the immediate rewardand the utility of the next state. Ut+1(s) ← Ut(s) + α(rt(s) + γUt(s’ ) – Ut(s))

learning rate reward obtained in state s

the observed successor state

U(s) is an estimate of V*(s), which is the maximum discounted cumulativereward starting in state s.


Ut+1(s) ← Ut(s) + α(rt(s) + γUt(s’ ) – Ut(s))

obstacle

+1

-1

3

2

1

1 2 3 4

-0.04 -0.04 -0.04

-0.04

-0.04-0.04-0.04

-0.04

-0.04

Example: Updating value of State 34 over three trialsU1 = 0 + 0.9(-0.04 + 0.9(1.0) – 0) = 0.774 [Then State 34 can be used to update adjacent states (33 and 23)]U2 = 0.774 + 0.9(-0.04 + 0.9(1.0) – 0.774) = 0.851U2 = 0.851 + 0.9(-0.04 + 0.9(1.0) – 0.851) = 0.859As Value/Utility function is learned, policy can be developed: for example,from each state, move to state which has highest value

AssumeInitial U0 = 0α = 0.9γ = 0.9

This represents “TD Error Signal”

Practical Applications of TD methods

Robotic navigation/motor controlElevator schedulingBackgammon (TD-gammon)...But does it have any relevance to biology?

As model of prediction-learning in n-armed banditproblems

Otherwise it has limitations

Hebbian Learning

Account of learning at cellular level:If two neurons are connected and activeat the same time, some change occurs tostrengthen the connection

Proposed by Donald O. Hebb as earlyas 1949

This provides a possible mechanism bywhich a given outcome could be“predicted” upon the occurrence of agiven input

But what kind of system coulddifferentiate among predictors?

http://www.qub.ac.uk/mgt/intsys/nnbiol.html

Sound

Smell

Error-Prediction (primates)

http://www.msu.edu/course/asc/813/

Neural correlate of “US” in bee brain

http://www.msu.edu/course/asc/813/

Error-Prediction (bee model)

http://psycserv.mcmaster.ca/~smitha/PAGE_RESEARCH/summary.html

Y

Nectar Visual Input

Action(choose Y or B)

r(t)δ(t) (error signal)

WY WB

BS

R

P

δ(t) = γ(rt + Vt) - (rt-1 + Vt-1)

Δwt = αxtδtVt = wt xt = wtB xt

B + wtY xt

Y

Error-Prediction

http://psycserv.mcmaster.ca/~smitha/PAGE_RESEARCH/summary.html

Limitations of TD learning as model for animals

• State-space has to be learned along with policy, andlearning of state-space is not part of TD approach

• Some success so far, however, for models of N-armedbandit problems, where state-space maps neatly ontoperceptual categories

• These have the potential to scale up to deal with second-order conditioning

• For more complex problems (when state space is large),theoretical TD algorithms take a long time to converge onoptimum

• In general, animals seem to be faster at solving “credit-assignment problem”

Credit assignment:figuring out what to learn and what to ignore

A common learning problem is to predict the occurrence ofan event, so as to prepare for it physiologically orbehaviorally.

In a complex environment, many possible cues or behaviorsmight be be correlated with a given outcome. The problem isto “assign credit” to the right antecedent event, so thatreliable predictions can be made in the future.

• TD methods build up credit assignment incrementally,through learned linkages between states

• Is this what animals do?

Active learning and the credit-assignment problem

Insects “turn on” learning of landmarks when they are guaranteed to be useful

Orientation flights at food (“Turn Back and Look”):• Done on departure after receiving reward• Guarantees learning of landmarks that will be useful on

return• Learning flight is modulated according to need for

information (Cindy Wei)

Orientation flights at nest (E. Capaldi):• Young bees: first flights are learning flights• After moving to new nest: ditto

DELAY9 mins.

Initial Phase Post-delay Phase

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180

1

2

3

4

5

6

7

8

Depa

rture

dur

atio

ns (s

ec)

Visit number

actual departure durationnonlinear regression

Contextual cues as another solution to creditassignment problem

Landmarks along route: “gated” by PI info?• To learn landmarks associated with homeward path, “turn on” learning

only when path integrator indicates you are heading homeward• Ditto for outbound landmarks when heading to food

B. Schatz, S. Chameron, G. Beugnon, T. S. Collett (1999) The use ofpath integration to guide route learning in ants Nature399, 769 - 772

• Ants head to food on straight path• They are required to find way home along

hairpin path, with a series of choicepoints

• They correctly learn those decisionsencountered when aligned with homevector

Cumulative No. Choices

Cum

ulat

ive N

o.

Cor

rect

Cho

ices

Learning - Michigan State Universityfcdyer/ZOL867/ZOL867Learning.pdf · Transitions in Learning...

Documents

Transcript of Learning - Michigan State Universityfcdyer/ZOL867/ZOL867Learning.pdf · Transitions in Learning...