Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

70
Neural Networks

Transcript of Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Page 1: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Neural Networks

Page 2: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Functions

Input Output

4, 4 8

2, 3 5

1, 9 10

6, 7 13

341, 257 598

Page 3: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Functions

Input Output

rock rock

sing sing

alqz alqz

dark dark

lamb lamb

Page 4: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Functions

Input Output

0 0 0

1 0 0

0 1 0

1 1 1

Page 5: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Functions

Input Output

look looked

rake raked

sing sang

go went

want wanted

Page 6: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Functions

Input Output

John left 1

Wallace fed Gromit 1

Fed Wallace Gromit 0

Who do you like Mary and? 0

Page 7: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Learning Functions

• In training, network is shown examples of what the function generates, and has to figure out what the function is.

• Think of language/grammar as a very big function (or set of functions). Learning task is similar – learner is presented with examples of what the function generates, and has to figure out what the system is.

• Main question in language acquisition: what does the learner need to know in order to successfully figure out what this function is?

• Questions about Neural Networks– How can a network represent a function?

– How can the network discover what this function is?

Page 8: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Input Output

0 0 01 0 00 1 01 1 1

AND Network

Page 9: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

NETWORK CONFIGURED BY TLEARN# weights after 10000 sweeps# WEIGHTS# TO NODE 1-1.9083807468 ## bias to 14.3717832565 ## i1 to 14.3582129478 ## i2 to 10.0000000000

OR Network

Input Output

0 0 01 0 10 1 11 1 1

Page 10: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

2-layer XOR Network

• In order for the network to model the XOR function, we need activation of either of the inputs to turn the output node “on” – just as in the OR network. This was achieved easily by making the negative weight on the bias be smaller in magnitude than the positive weight on either of the inputs.

However, in the XOR network we also want the effect of turning both inputs on to be to turn the output node “off”. Since turning both nodes on can only increase the total input to the output node, and the output is switched “off” when it receives less input, this effect cannot be achieved.

• The XOR function is not linearly separable, and hence it cannot be represented by a two-layer network. This is a classic result in the theory of neural networks.

Page 11: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

-3.0456776619 ## bias to 15.5165352821 ## i1 to 1-5.7562727928 ## i2 to 1

-3.6789164543 ## bias to 2-6.4448370934 ## i1 to 26.4957633018 ## i2 to 2

-4.4429202080 ## bias to output9.0652370453 ## 1 to output8.9045801163 ## 2 to output

XOR Network

Input Output

0 0 01 0 10 1 01 1 0

Input Output

0 0 01 0 00 1 11 1 0

Input Output

0 0 01 0 10 1 11 1 1

The mapping from the hidden units to output is an OR network, that never receives a [1 1] input.

Page 12: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Learning Rate

• The learning rate, which is explained in chapter 1 (pp. 12-13), is a training parameter which basically determines how strongly the network responds to an error signal at each training cycle. The higher the learning rate, the bigger the change the network will make in response to a large error. Sometimes having a high learning rate will be beneficial, at other times it can be quite disastrous for the network. An example of sensitivity to learning rate can be found in the case of the XOR network discussed in chapter 4.

• Why should it be a bad thing to make big corrections in response to big errors? The reason for this is that the network is looking for the best general solution to mapping all of the input-output pairs, but the network normally adjusts weights in response to an individual input-output pair. Since the network has no knowledge of how representative any individual input-output pair is of the general trend in the training set, it would be rash for the network to respond too strongly to any individual error signal. By making many small responses to the error signals, the network learns a bit more slowly, but it is protected against being messed up by outliers in the data.

Page 13: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Momentum

Just as with learning rate, sometimes the learning algorithm can only find a good solution to a problem if the momentum training parameter is set to a specific value. What does this mean, and why should it make a difference?

If momentum is set to a high value, then the weight changes made by the network are very similar from one cycle to the next. If momentum is set to a low value, then the weight changes made by the network can be very different on adjacent cycles.

So what?

Page 14: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Momentum

In searching for the best available configuration to model the training data, the network has no ‘knowledge’ of what the best solution is, or even whether there is a particularly good solution at all. It therefore needs some efficient and reliable way of searching the range of possible weight-configurations for the best solution.

One thing that can be done is for the network to test whether any small changes to its current weight-configuration lead to improved performance. If so, then it can make that change. Then it can ask the same question in its new weight-configuration, and again modify the weights if there is a small change that leads to improvement. This is a fairly effective way for a blind search to proceed, but it has inherent dangers – the network might come across a weight-configuration which is better than all very similar configurations, but is not the best configuration of all. In this situation, the network can figure out that no small changes improve performance, and will therefore not modify its weights. It therefore ‘thinks’ that it has reached an optimal solution, but this is an incorrect conclusion. This problem is known as a local maximum or local minimum.

Page 15: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 16: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Momentum

Momentum can serve to help the network avoid local maxima, by controlling the ‘scale’ at which the search for a solution proceeds. If momentum is set high, then changes in the weight-configuration are very similar from one cycle to the next. A consequence of this is that early in training, when error levels are typically high, weight changes will be consistently large. Because weight changes are forced to be large, this can help the network avoid getting trapped in a local maximum.

A decision about the momentum value to be used for learning amounts to a hypothesis about the nature of the problem being learned,

i.e., it is a form of innate knowledge, although not of the kind that we are accustomed to dealing with.

Page 17: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 18: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

The Past Tense and Beyond

Page 19: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Classic Developmental Story

• Initial mastery of regular and irregular past tense forms

• Overregularization appears only later (e.g. goed, comed)

• ‘U-Shaped’ developmental pattern taken as evidence for learning of a morphological rule

V + [+past] --> stem + /d/

Page 20: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Rumelhart & McClelland 1986

Model learns to classify regulars and irregulars,based on sound similarity alone.Shows U-shaped developmental profile.

Page 21: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

What is really at stake here?

• Abstraction

• Operations over variables

– Symbol manipulation

– Algebraic computation

• Learning based on input

– How do learners generalize beyond input?

y = 2x

Page 22: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

What is not at stake here

• Feedback, negative evidence, etc.

Page 23: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Who has the most at stake here?

• Those who deny the need for rules/variables in language have the most to lose here…if the English past tense is hard, just wait until you get to the rest of natural language!

• …but if they are successful, they bring with them a simple and attractive learning theory, and mechanisms that can readily be grounded at the neural level

• However, if the advocates of rules/variables succeed here or elsewhere, they face the more difficult challenge at the neuroscientific level

Page 24: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 25: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Pinker

Ullman

Page 26: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 27: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 28: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

1. Are regulars different?2. Do regulars implicate operations over variables?

Neuropsychological Dissociations

Other Domains of Morphology

Beyond Sound Similarity

Regulars and Associative Memory

Page 29: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

(Pinker & Ullman 2002)

Page 30: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Beyond Sound Similarity

• Zero-derived denominals are regular

– Soldiers ringed the city

– *Soldiers rang the city

– high-sticked, grandstanded, …

– *high-stuck, *grandstood, …

• Productive in adults & children

• Shows sensitivity to morphological structure[[ stem N] ø V]-ed

• Provides good evidence that sound similarity is not everything

• But nothing prevents a model from using richer similarity metric– morphological structure (for ringed)

– semantic similarity (for low-lifes)

Page 31: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

1. Are regulars different?2. Do regulars implicate operations over variables?

Neuropsychological Dissociations

Other Domains of Morphology

Beyond Sound Similarity

Regulars and Associative Memory

Page 32: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Regulars & Associative Memory

• Regulars are productive, need not be stored

• Irregulars are not productive, must be stored

• But are regulars immune to effects of associative memory?– frequency

– over-irregularization

• Pinker & Ullman:– regulars may be stored

– but they can also be generated on-the-fly

– ‘race’ can determine which of the two routes wins

– some tasks more likely to show effects of stored regulars

Page 33: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Child vs. Adult Impairments

• Specific Language Impairment– Early claims that regulars

show greater impairment than irregulars are not confirmed

• Pinker & Ullman 2002b– ‘The best explanation is that

language-impaired people are indeed impaired with rules, […] but can memorize common regular forms.’

– Regulars show consistent frequency effects in SLI, not in controls.

– ‘This suggests that children growing up with a grammatical deficit are better at compensating for it via memorization than are adults who acquired their deficit later in life.’

Page 34: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

1. Are regulars different?2. Do regulars implicate operations over variables?

Neuropsychological Dissociations

Other Domains of Morphology

Beyond Sound Similarity

Regulars and Associative Memory

Page 35: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Neuropsychological Dissociations

• Ullman et al. 1997– Alzheimer’s disease patients

• Poor memory retrieval

• Poor irregulars

• Good regulars

– Parkinson’s disease patients• Impaired motor control, good

memory

• Good irregulars

• Poor regulars

• Striking correlation involving laterality of effect

• Marslen-Wilson & Tyler 1997– Normals

• past tense primes stem

– 2 Broca’s Patients• irregulars prime stems

• inhibition for regulars

– 1 patient with bilateral lesion• regulars prime stems

• no priming for irregulars or semantic associates

Page 36: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Morphological Priming

• Lexical Decision Task– CAT, TAC, BIR, LGU, DOG

– press ‘Yes’ if this is a word

• Priming– facilitation in decision times

when related word precedes target (relative to unrelated control)

– e.g., {dog, rug} - cat

• Marslen-Wilson & Tyler 1997– Regular

{jumped, locked} - jump

– Irregular{found, shows} - find

– Semantic{swan, hay} - goose

– Sound{gravy, sherry} - grave

Page 37: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 38: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Neuropsychological Dissociations

• Bird et al. 2003– complain that arguments for

selective difficulty with regulars are confounded with the phonological complexity of the word-endings

• Pinker & Ullman 2002– weight of evidence still

supports dissociation; Bird et al.’s materials contained additional confounds

Page 39: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Brain Imaging Studies

• Jaeger et al. 1996, Language– PET study of past tense

– Task: generate past from stem

– Design: blocked conditions

– Result: different areas of activation for regulars and irregulars

• Is this evidence decisive?– task demands very different

– difference could show up in network

– doesn’t implicate variables

• Münte et al. 1997– ERP study of violations

– Task: sentence reading

– Design: mixed

– Result:• regulars: ~LAN

• irregulars: ~N400

• Is this evidence decisive?– allows possibility of comparison

with other violations

Page 40: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Regular

Irregular

Nonce

(Jaeger et al. 1996)

Page 41: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

1. Are regulars different?2. Do regulars implicate operations over variables?

Neuropsychological Dissociations

Other Domains of Morphology

Beyond Sound Similarity

Regulars and Associative Memory

Page 42: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 43: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

(Clahsen, 1999)

Page 44: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Low-Frequency Defaults

• German Plurals– die Straße die Straßen

die Frau die Frauen

– der Apfel die Äpfeldie Mutter die Mütter

– das Auto die Autosder Park die Parks

die Schmidts

• -s plural low frequency, used for loan-words, denominals, names, etc.

• Response– frequency is not the critical

factor in a system that focuses on similarity

– distribution in the similarity space is crucial

– similarity space with islands of reliability

• network can learn islands

• or network can learn to associate a form with the space between the islands

Page 45: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Similarity Space

Page 46: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Similarity Space

Page 47: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

German Plurals

(Hahn & Nakisa 2000)

Page 48: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Arabic Broken Plural

• CvCC– nafs nufuus ‘soul’– qidh qidaah ‘arrow’

• CvvCv(v)C– xaatam xawaatim ‘signet ring’– jaamuus jawaamiis ‘buffalo’

• Sound Plural– shuway?ir shuway?ir-uun ‘poet (dim.)’– kaatib kaatib-uun ‘writing (participle)’– hind hind-aat ‘Hind (fem. name)’– ramadaan ramadaan-aat ‘Ramadan (month)’

Page 49: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

• How far can a model generalize to novel forms?

– All novel forms that it can represent

– Only some of the novel forms that it can represent

• Velar fricative [x], e.g., Bach

– Could the Lab 2b model generate the past tense for Bach?

Page 50: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 51: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Hebrew Word Formation

• Roots

– lmd learning

– dbr talking

• Word patterns

– CiCeC limed ‘he learned’

– CiCeC diber ‘he talked’

– CaCaC lamad ‘he studied’

– CiCuC limud ‘study’

– hitCaCeChitlamed ‘he taught himself’

Page 52: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 53: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

• English phonemes absent from Hebrew

– j (as in jeep)

– ch (as in chair)

– th (as in thick) <-- features absent from Hebrew

– w (as in wide)

• Do speakers generalize the Obligatory Contour Principle (OCP) constraint effects?

– XXY < YXX

– jjr < rjj

Page 54: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

• Root position vs. word position

– *jjr

– jajartem

– hijtajartem hiCtaCaCtem

Page 55: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Ratings derived from rankings for word-triples1 = best, 3 = worst, scores subtracted from 4

Page 56: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 57: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Abstraction

• Phonological categories, e.g., /ba/

– Treating different sounds as equivalent

– Failure to discriminate members of the same category

– Treating minimally different words as the same

– Efficient memory encoding

• Morphological concatenation, e.g., V + ed

– Productivity: generalization to novel words, novel sounds

– Frequency-insensitivity in memory encoding

– Association with other aspects of ‘procedural memory’

Page 58: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Gary Marcus

Page 59: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Generalization

• Training Items– Input: 1 0 1 0 Output: 1 0 1 0

– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0

• Test Item– Input: 1 1 1 1 Output ? ? ? ?

Page 60: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Generalization

• Training Items– Input: 1 0 1 0 Output: 1 0 1 0

– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0

• Test Item– Input: 1 1 1 1 Output ? ? ? ?

1 1 1 1 (Humans)

1 1 1 0 (Network)

Page 61: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Generalization

• Training Items– Input: 1 0 1 0 Output: 1 0 1 0

– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0

• Test Item– Input: 1 1 1 1 Output ? ? ? ?

• Generalization fails because learning is local

1 1 1 1 (Humans)

1 1 1 0 (Network)

Page 62: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Generalization

• Training Items– Input: 1 0 1 0 Output: 1 0 1 0

– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0

• Test Item– Input: 1 1 1 1 Output ? ? ? ?

• Generalization succeeds because representations are shared

1 1 1 1 (Humans)

1 1 1 1 (Network)

Page 63: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Now another example…

Page 64: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Shared Representation

Copying 1:

Copying 2:

“The key to the representation of variables is whether all inputsin a class are represented by a single node.”

Page 65: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Generalization

• “In each domain in which there is generalization, it is an empirical question whether the generalization is restricted to items that closely resemble training items or whether the generalization can be freely extended to all novel items within some class.”

Page 66: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Syntax, Semantics, & Statistics

Page 67: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 68: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 69: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.
Page 70: Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Starting Small Simulation

• How well does the network perform?

• How does it manage to learn?