Learning - Michigan State Universityfcdyer/ZOL867/ZOL867Learning.pdf · Transitions in Learning...
Transcript of Learning - Michigan State Universityfcdyer/ZOL867/ZOL867Learning.pdf · Transitions in Learning...
Learning
Learning Defined• (Adaptive) change in behavior as a result of experience
This view is neutral with respect to underlying mechanisms,whether in forward- or reverse-engineering approach
• Acquisition of information/knowledge throughexperience/observation (leading to improved performanceor decision-making)Here the focus is more explicitly on the mechanisms bywhich the information is encoded and used
Learning Topics
Topics• History (beginning 20th century)• Associative learning
•What it explains•What it doesn’t•Can S-R and representational accounts be reconciled?
• General-process vs. specialized learning mechanisms• Active learning• Learning, adaptation and evolution (Thursday)• Natural and artificial systems (Next week)
A bit of history…
Learning and Behaviorism• Behaviorism is both a philosophy and a method• Behaviorism focuses on observable relationship between experience
(input) and behavior (output)• Most behavior develops through associative learning (Pavlovian and
Operant conditioning)
Radical Behaviorism (Skinner)• Only behavior (and antecedent causes in environment) can be studied
objectively; hypotheses about internal events are unscientific• By extension, hypotheses about innate processes are unscientific
(because antecedent causes prior to birth of animal can’t be observed).• Radical Behaviorism derives from Empiricism
A little more history…
The (slow) demise of Radical Behaviorism• Chomsky: human language can’t be accounted for by
Behaviorist/Empiricist theories based only on stimulus, response andreinforcement experienced by learner--we need to posit internal(innate?) grammatical rules that generate novel, never-experiencedlinguistic structures/knowledge
• AI (Newell and Simon)• Ethology/Neurobiology: vindication of “Nativist” view of behavioral
development
Definition: behavior changes as result of experiencing association between twoevents (E1 and E2)
Classical (Pavlovian) conditioning:
Instrumental/Operant/Trial-and-error conditioning:
•E2 (intrinsically meaningful stimulus)--leads to reflexive response•E1 (arbitrary stimulus)--comes to trigger response if experienced prior to E2
E1 E2 ResponseBell Food Salivation (unconditioned)Bell Salivation (conditioned)
•E2: an intrinsically meaningful stimulus (e.g. food or pain)•E1: initially arbitrary action--strengthened or weakened when associated with E2
E1(action) E2(reinforcement) Press Bar Food
reinforcement
Operant conditioning exemplifies Thorndike’s “Law of Effect”
Associative Learning
Traditional Learning Theory(through 1960s)
• Focus was on discovering general “laws of learning”• Order in which E1 and E2 must occur• Effects of salience• Effects of time delay• Effects of combining stimuli• Extinction of responses
• Avoided speculation about underlying mechanisms• But implication was that there were certain “general
processes” involving strengthening of S-R connections• Complex behavior entailed chaining of S-R associations• “Knowledge” consists of chains of S-S associations• “Mind” is a huge look-up table
Transitions in Learning Theory• Associative Learning not all-encompassing
• Language• “Imitative” learning (e.g., bird song)• “Latent” learning (e.g., spatial exploration in absence of
reward)• Even where associative learning operates, General Process
assumption crumbled• Different “laws” for different learned behaviors, e.g.,
associability of different events varies according to theproblem
• Species differences in what can be learned and when• Dissociations in human and animal learning
Specializations: example from Animal LearningMeadow voles: males “polygynous,” and have larger home range than femalesPrairie voles: monogamous, and male and female home ranges are similar
In meadow voles, male spatial learning is better than that of femalesIn prairie voles, spatial learning is similar in the two sexes
Specializations: Ideas from Cognitive Psychology
Distinctions proposed for different systems oflearning/memory• Short-term vs. Long-term• Explicit vs. Implicit (roughly, conscious vs. unconscious)• Declarative (“knowing that”) vs. Procedural (“knowing
how”)• Instance-based vs. Rule-based• Episodic vs. Semantic (in re autobiographical knowledge of
past)
Episodic Memory in Animals?Do scrub jays have episodic memory?
http://freespace.virgin.net/cliff.buckton/Birding/California/Calif17.jpg
Clayton & Dickinson 1998
Where did I put that wormAnd when did I put it there?
Endel Tulving (2001):Episodic memory entails conscious recallof autobiographical information, and theonly evidence we have of consciousrecall is a subject’s verbal report
Associative Learning and the“Representational Theory of Mind”
S-R and Cognitive (representational) accounts are oftenpitted against one another. Can they be reconciled?
• Associative learning may be involved in formationof complex representations (elements of experiencebound together because of spatial-temporalcontiguity)
• Internal representation of an important aspect ofexperience may function as input to associativelearning mechanism
Active Learning of Landmarks
Definition of landmark: any feature reliably associated with goalWhat makes a good landmark?
• For long distance guidance: distant, low motion parallax (e.g.,sun, distant mountaints
• For pinpointing a location: nearby, high motion parallax
Many animals, both vertebrate andinvertebrate, have been shown to prefernearby landmarks to learn locations: how dothey figure out which ones are nearby?
How to identify landmark (big concern in robotics)?• Static cues: color, contrast, symmetry, persistence over time• Motion cues (to identify nearby features: TBL• Association with context or goals
Active Vision and Landmark Learning:Segmenting the scene into near and far elements
Flying insects actively examinescene around goal
Voss R (1995) Information durch Eigenbewegung: Rekonstruktion und Analyse des Bildflussesam Auge fliegender Insekten. Doctoral Thesis, Universität Tübingen
Flight path generatesmotion signals that mayallow insect to pick outnearby (and henceuseful) landmarksagainst background
A View From The Wasp’s “Cockpit”
Reconstructed scene
Motion in scene
Food
Bee’s pathas shedeparts food
Eckert, M.P. & J Zeil 2001. Toward an ecology ofmotion vision. In: Motion Vision - Computational, Neural,and Ecological Constraints Ed. by J.M. Zanker and J. ZeilSpringer Verlag, Berlin Heidelberg New York
Computational Approaches to Learning
In Psychology:• Motivated by the observation of “learning curves” showing characteristic
quantitative relationship between experience and performance• Role of models
• Predict behavior/performance• Test hypotheses about mechanisms
Rescorla-Wagner
∆Vcs = c (Vmax – Vnet)Change inassociativestrength
Learning rate(a function ofsalience orlearnability ofUS and CS)
Maximum associativestrength (a function ofstrength of UR, or time lagsbetween CS and US
Currentassociativestrength
Measures“surprisingness”of event
Computational Approaches to Learning-cont’d
Rescorla-Wagner explains a lot• Shapes of learning curves• Extinction (waning of associative strength following non-rewarding
trials• Responses to compound stimuli
• Overshadowing Blocking
• Many other things• Integrating associative and representational approaches
Train: [AX] -> USTest: [X] -> US(weakened response to X byitself, which isn’t as predictiveas AX)
Pretrain: A -> USTrain: [AX] -> USTest: [X] -> US(no response to X: it adds nopredictive value)
Computational Approaches to Learning-cont’d
Rescorla-Wagner also fails to explainsome things
• Recovery from extinction (spontaneousor stimulus-triggered)
• Second-order conditioning
Train: CS1 -> USTrain: CS2 -> CS1Rescorla-Wagner can’t explainthis, because CS2 never getspaired with original reinforcer
Computational Approaches to Learning
In Computer Science/AI• Learning is considered a good solution to certain
problems• Classifying complex patterns• Making predictions given uncertainty in environment• Conferring autonomy on devices
• Making machines that learn entails a clear specificationof the problem• State space• Action space• Performance metric for evaluating different responses(“utility function”)
• Goal: compute optimal “policy” (mapping from stateto action)
Reinforcement Learning
Defines a class of problems• Agent learns from its own experience in environment, rather than
from supervised teaching• High degree of uncertainty (resulting from environmental
unpredictability• Animal improves based upon receipt of “rewards,” but may not be
rewarded until a sequence of actions has been completed• Thus, there is a problem of assigning credit to actions that haveworked
• Also there is a problem of generalizing to new situations• Also there is problem of learning about features of environment notyet experienced
A Grid-World (Spatial Cognition) Example
obstacle
+1
-1
3
2
1
1 2 3 4
terminal states
-0.04 -0.04 -0.04
-0.04
-0.04-0.04-0.04
-0.04
-0.04
+1
-1
+1
-1
First trial
Second trial
Total PayoffTrial 1: -1.16Trial 2: +0.80
While wandering throughenvironment, agent mayexperience total payoffs thatare very bad or very good
• Numbers show immediate payoff in each state.• Arrows show an “optimal policy”, i.e., one that
will give maximum total payoff if followed fromeach state
Temporal Difference Methods
Definition: A set of algorithms whereby anagent could be programmed to predict ineach state the reward received in the nextstate (or upon taking a given action)• TD methods are closely related toRescorla-Wagner model
• Can deal with second-order conditioning,whereby a state gains value by an indirect(sequential) association with reward state
• Goal: learn the true value of each state,assuming the optimal policy is followed
obstacle
+1
-1
3
2
1
1 2 3 4
-0.04 -0.04 -0.04
-0.04
-0.04-0.04-0.04
-0.04
-0.04
Temporal Difference Methods
The objective is to learn an estimate of the utility of allstates. The utility is the expected sum of rewards fromthis state on.
Key idea: Use insight from dynamic programming toadjust the utility of a state based on the immediate rewardand the utility of the next state. Ut+1(s) ← Ut(s) + α(rt(s) + γUt(s’ ) – Ut(s))
learning rate reward obtained in state s
the observed successor state
U(s) is an estimate of V*(s), which is the maximum discounted cumulativereward starting in state s.
Temporal Difference Methods
Ut+1(s) ← Ut(s) + α(rt(s) + γUt(s’ ) – Ut(s))
obstacle
+1
-1
3
2
1
1 2 3 4
-0.04 -0.04 -0.04
-0.04
-0.04-0.04-0.04
-0.04
-0.04
Example: Updating value of State 34 over three trialsU1 = 0 + 0.9(-0.04 + 0.9(1.0) – 0) = 0.774 [Then State 34 can be used to update adjacent states (33 and 23)]U2 = 0.774 + 0.9(-0.04 + 0.9(1.0) – 0.774) = 0.851U2 = 0.851 + 0.9(-0.04 + 0.9(1.0) – 0.851) = 0.859As Value/Utility function is learned, policy can be developed: for example,from each state, move to state which has highest value
AssumeInitial U0 = 0α = 0.9γ = 0.9
This represents “TD Error Signal”
Practical Applications of TD methods
Robotic navigation/motor controlElevator schedulingBackgammon (TD-gammon)...But does it have any relevance to biology?
As model of prediction-learning in n-armed banditproblems
Otherwise it has limitations
Hebbian Learning
Account of learning at cellular level:If two neurons are connected and activeat the same time, some change occurs tostrengthen the connection
Proposed by Donald O. Hebb as earlyas 1949
This provides a possible mechanism bywhich a given outcome could be“predicted” upon the occurrence of agiven input
But what kind of system coulddifferentiate among predictors?
http://www.qub.ac.uk/mgt/intsys/nnbiol.html
Sound
Smell
Error-Prediction (bee model)
http://psycserv.mcmaster.ca/~smitha/PAGE_RESEARCH/summary.html
Y
Nectar Visual Input
Action(choose Y or B)
r(t)δ(t) (error signal)
WY WB
BS
R
P
δ(t) = γ(rt + Vt) - (rt-1 + Vt-1)
Δwt = αxtδtVt = wt xt = wtB xt
B + wtY xt
Y
Limitations of TD learning as model for animals
• State-space has to be learned along with policy, andlearning of state-space is not part of TD approach
• Some success so far, however, for models of N-armedbandit problems, where state-space maps neatly ontoperceptual categories
• These have the potential to scale up to deal with second-order conditioning
• For more complex problems (when state space is large),theoretical TD algorithms take a long time to converge onoptimum
• In general, animals seem to be faster at solving “credit-assignment problem”
Credit assignment:figuring out what to learn and what to ignore
A common learning problem is to predict the occurrence ofan event, so as to prepare for it physiologically orbehaviorally.
In a complex environment, many possible cues or behaviorsmight be be correlated with a given outcome. The problem isto “assign credit” to the right antecedent event, so thatreliable predictions can be made in the future.
• TD methods build up credit assignment incrementally,through learned linkages between states
• Is this what animals do?
Active learning and the credit-assignment problem
Insects “turn on” learning of landmarks when they are guaranteed to be useful
Orientation flights at food (“Turn Back and Look”):• Done on departure after receiving reward• Guarantees learning of landmarks that will be useful on
return• Learning flight is modulated according to need for
information (Cindy Wei)
Orientation flights at nest (E. Capaldi):• Young bees: first flights are learning flights• After moving to new nest: ditto
DELAY9 mins.
Initial Phase Post-delay Phase
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 180
1
2
3
4
5
6
7
8
Depa
rture
dur
atio
ns (s
ec)
Visit number
actual departure durationnonlinear regression
Contextual cues as another solution to creditassignment problem
Landmarks along route: “gated” by PI info?• To learn landmarks associated with homeward path, “turn on” learning
only when path integrator indicates you are heading homeward• Ditto for outbound landmarks when heading to food
B. Schatz, S. Chameron, G. Beugnon, T. S. Collett (1999) The use ofpath integration to guide route learning in ants Nature399, 769 - 772
• Ants head to food on straight path• They are required to find way home along
hairpin path, with a series of choicepoints
• They correctly learn those decisionsencountered when aligned with homevector
Cumulative No. Choices
Cum
ulat
ive N
o.
Cor
rect
Cho
ices