Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Neural Networks

Functions

Input Output

4, 4 8

2, 3 5

1, 9 10

6, 7 13

341, 257 598

Functions

Input Output

rock rock

sing sing

alqz alqz

dark dark

lamb lamb

Functions

Input Output

0 0 0

1 0 0

0 1 0

1 1 1

Functions

Input Output

look looked

rake raked

sing sang

go went

want wanted

Functions

Input Output

John left 1

Wallace fed Gromit 1

Fed Wallace Gromit 0

Who do you like Mary and? 0

Learning Functions

• In training, network is shown examples of what the function generates, and has to figure out what the function is.

• Think of language/grammar as a very big function (or set of functions). Learning task is similar – learner is presented with examples of what the function generates, and has to figure out what the system is.

• Main question in language acquisition: what does the learner need to know in order to successfully figure out what this function is?

• Questions about Neural Networks– How can a network represent a function?

– How can the network discover what this function is?

Input Output

0 0 01 0 00 1 01 1 1

AND Network

NETWORK CONFIGURED BY TLEARN# weights after 10000 sweeps# WEIGHTS# TO NODE 1-1.9083807468 ## bias to 14.3717832565 ## i1 to 14.3582129478 ## i2 to 10.0000000000

OR Network

Input Output

0 0 01 0 10 1 11 1 1

2-layer XOR Network

• In order for the network to model the XOR function, we need activation of either of the inputs to turn the output node “on” – just as in the OR network. This was achieved easily by making the negative weight on the bias be smaller in magnitude than the positive weight on either of the inputs.

However, in the XOR network we also want the effect of turning both inputs on to be to turn the output node “off”. Since turning both nodes on can only increase the total input to the output node, and the output is switched “off” when it receives less input, this effect cannot be achieved.

• The XOR function is not linearly separable, and hence it cannot be represented by a two-layer network. This is a classic result in the theory of neural networks.

-3.0456776619 ## bias to 15.5165352821 ## i1 to 1-5.7562727928 ## i2 to 1

-3.6789164543 ## bias to 2-6.4448370934 ## i1 to 26.4957633018 ## i2 to 2

-4.4429202080 ## bias to output9.0652370453 ## 1 to output8.9045801163 ## 2 to output

XOR Network

Input Output

0 0 01 0 10 1 01 1 0

Input Output

0 0 01 0 00 1 11 1 0

Input Output

0 0 01 0 10 1 11 1 1

The mapping from the hidden units to output is an OR network, that never receives a [1 1] input.

Learning Rate

• The learning rate, which is explained in chapter 1 (pp. 12-13), is a training parameter which basically determines how strongly the network responds to an error signal at each training cycle. The higher the learning rate, the bigger the change the network will make in response to a large error. Sometimes having a high learning rate will be beneficial, at other times it can be quite disastrous for the network. An example of sensitivity to learning rate can be found in the case of the XOR network discussed in chapter 4.

• Why should it be a bad thing to make big corrections in response to big errors? The reason for this is that the network is looking for the best general solution to mapping all of the input-output pairs, but the network normally adjusts weights in response to an individual input-output pair. Since the network has no knowledge of how representative any individual input-output pair is of the general trend in the training set, it would be rash for the network to respond too strongly to any individual error signal. By making many small responses to the error signals, the network learns a bit more slowly, but it is protected against being messed up by outliers in the data.

Momentum

Just as with learning rate, sometimes the learning algorithm can only find a good solution to a problem if the momentum training parameter is set to a specific value. What does this mean, and why should it make a difference?

If momentum is set to a high value, then the weight changes made by the network are very similar from one cycle to the next. If momentum is set to a low value, then the weight changes made by the network can be very different on adjacent cycles.

So what?

Momentum

In searching for the best available configuration to model the training data, the network has no ‘knowledge’ of what the best solution is, or even whether there is a particularly good solution at all. It therefore needs some efficient and reliable way of searching the range of possible weight-configurations for the best solution.

One thing that can be done is for the network to test whether any small changes to its current weight-configuration lead to improved performance. If so, then it can make that change. Then it can ask the same question in its new weight-configuration, and again modify the weights if there is a small change that leads to improvement. This is a fairly effective way for a blind search to proceed, but it has inherent dangers – the network might come across a weight-configuration which is better than all very similar configurations, but is not the best configuration of all. In this situation, the network can figure out that no small changes improve performance, and will therefore not modify its weights. It therefore ‘thinks’ that it has reached an optimal solution, but this is an incorrect conclusion. This problem is known as a local maximum or local minimum.

Momentum

Momentum can serve to help the network avoid local maxima, by controlling the ‘scale’ at which the search for a solution proceeds. If momentum is set high, then changes in the weight-configuration are very similar from one cycle to the next. A consequence of this is that early in training, when error levels are typically high, weight changes will be consistently large. Because weight changes are forced to be large, this can help the network avoid getting trapped in a local maximum.

A decision about the momentum value to be used for learning amounts to a hypothesis about the nature of the problem being learned,

i.e., it is a form of innate knowledge, although not of the kind that we are accustomed to dealing with.

The Past Tense and Beyond

Classic Developmental Story

• Initial mastery of regular and irregular past tense forms

• Overregularization appears only later (e.g. goed, comed)

• ‘U-Shaped’ developmental pattern taken as evidence for learning of a morphological rule

V + [+past] --> stem + /d/

Rumelhart & McClelland 1986

Model learns to classify regulars and irregulars,based on sound similarity alone.Shows U-shaped developmental profile.

What is really at stake here?

• Abstraction

• Operations over variables

– Symbol manipulation

– Algebraic computation

• Learning based on input

– How do learners generalize beyond input?

y = 2x

What is not at stake here

• Feedback, negative evidence, etc.

Who has the most at stake here?

• Those who deny the need for rules/variables in language have the most to lose here…if the English past tense is hard, just wait until you get to the rest of natural language!

• …but if they are successful, they bring with them a simple and attractive learning theory, and mechanisms that can readily be grounded at the neural level

• However, if the advocates of rules/variables succeed here or elsewhere, they face the more difficult challenge at the neuroscientific level

Pinker

Ullman

1. Are regulars different?2. Do regulars implicate operations over variables?

Neuropsychological Dissociations

Other Domains of Morphology

Beyond Sound Similarity

Regulars and Associative Memory

(Pinker & Ullman 2002)


• Zero-derived denominals are regular

– Soldiers ringed the city

– *Soldiers rang the city

– high-sticked, grandstanded, …

– *high-stuck, *grandstood, …

• Productive in adults & children

• Shows sensitivity to morphological structure[[ stem N] ø V]-ed

• Provides good evidence that sound similarity is not everything

• But nothing prevents a model from using richer similarity metric– morphological structure (for ringed)

– semantic similarity (for low-lifes)

Regulars & Associative Memory

• Regulars are productive, need not be stored

• Irregulars are not productive, must be stored

• But are regulars immune to effects of associative memory?– frequency

– over-irregularization

• Pinker & Ullman:– regulars may be stored

– but they can also be generated on-the-fly

– ‘race’ can determine which of the two routes wins

– some tasks more likely to show effects of stored regulars

Child vs. Adult Impairments

• Specific Language Impairment– Early claims that regulars

show greater impairment than irregulars are not confirmed

• Pinker & Ullman 2002b– ‘The best explanation is that

language-impaired people are indeed impaired with rules, […] but can memorize common regular forms.’

– Regulars show consistent frequency effects in SLI, not in controls.

– ‘This suggests that children growing up with a grammatical deficit are better at compensating for it via memorization than are adults who acquired their deficit later in life.’


• Ullman et al. 1997– Alzheimer’s disease patients

• Poor memory retrieval

• Poor irregulars

• Good regulars

– Parkinson’s disease patients• Impaired motor control, good

memory

• Good irregulars

• Poor regulars

• Striking correlation involving laterality of effect

• Marslen-Wilson & Tyler 1997– Normals

• past tense primes stem

– 2 Broca’s Patients• irregulars prime stems

• inhibition for regulars

– 1 patient with bilateral lesion• regulars prime stems

• no priming for irregulars or semantic associates

Morphological Priming

• Lexical Decision Task– CAT, TAC, BIR, LGU, DOG

– press ‘Yes’ if this is a word

• Priming– facilitation in decision times

when related word precedes target (relative to unrelated control)

– e.g., {dog, rug} - cat

• Marslen-Wilson & Tyler 1997– Regular

{jumped, locked} - jump

– Irregular{found, shows} - find

– Semantic{swan, hay} - goose

– Sound{gravy, sherry} - grave


• Bird et al. 2003– complain that arguments for

selective difficulty with regulars are confounded with the phonological complexity of the word-endings

• Pinker & Ullman 2002– weight of evidence still

supports dissociation; Bird et al.’s materials contained additional confounds

Brain Imaging Studies

• Jaeger et al. 1996, Language– PET study of past tense

– Task: generate past from stem

– Design: blocked conditions

– Result: different areas of activation for regulars and irregulars

• Is this evidence decisive?– task demands very different

– difference could show up in network

– doesn’t implicate variables

• Münte et al. 1997– ERP study of violations

– Task: sentence reading

– Design: mixed

– Result:• regulars: ~LAN

• irregulars: ~N400

• Is this evidence decisive?– allows possibility of comparison

with other violations

Regular

Irregular

Nonce

(Jaeger et al. 1996)

(Clahsen, 1999)

Low-Frequency Defaults

• German Plurals– die Straße die Straßen

die Frau die Frauen

– der Apfel die Äpfeldie Mutter die Mütter

– das Auto die Autosder Park die Parks

die Schmidts

• -s plural low frequency, used for loan-words, denominals, names, etc.

• Response– frequency is not the critical

factor in a system that focuses on similarity

– distribution in the similarity space is crucial

– similarity space with islands of reliability

• network can learn islands

• or network can learn to associate a form with the space between the islands

Similarity Space

German Plurals

(Hahn & Nakisa 2000)

Arabic Broken Plural

• CvCC– nafs nufuus ‘soul’– qidh qidaah ‘arrow’

• CvvCv(v)C– xaatam xawaatim ‘signet ring’– jaamuus jawaamiis ‘buffalo’

• Sound Plural– shuway?ir shuway?ir-uun ‘poet (dim.)’– kaatib kaatib-uun ‘writing (participle)’– hind hind-aat ‘Hind (fem. name)’– ramadaan ramadaan-aat ‘Ramadan (month)’

• How far can a model generalize to novel forms?

– All novel forms that it can represent

– Only some of the novel forms that it can represent

• Velar fricative [x], e.g., Bach

– Could the Lab 2b model generate the past tense for Bach?

Hebrew Word Formation

• Roots

– lmd learning

– dbr talking

• Word patterns

– CiCeC limed ‘he learned’

– CiCeC diber ‘he talked’

– CaCaC lamad ‘he studied’

– CiCuC limud ‘study’

– hitCaCeChitlamed ‘he taught himself’

• English phonemes absent from Hebrew

– j (as in jeep)

– ch (as in chair)

– th (as in thick) <-- features absent from Hebrew

– w (as in wide)

• Do speakers generalize the Obligatory Contour Principle (OCP) constraint effects?

– XXY < YXX

– jjr < rjj

• Root position vs. word position

– *jjr

– jajartem

– hijtajartem hiCtaCaCtem

Ratings derived from rankings for word-triples1 = best, 3 = worst, scores subtracted from 4

Abstraction

• Phonological categories, e.g., /ba/

– Treating different sounds as equivalent

– Failure to discriminate members of the same category

– Treating minimally different words as the same

– Efficient memory encoding

• Morphological concatenation, e.g., V + ed

– Productivity: generalization to novel words, novel sounds

– Frequency-insensitivity in memory encoding

– Association with other aspects of ‘procedural memory’

Gary Marcus

Generalization

• Training Items– Input: 1 0 1 0 Output: 1 0 1 0

– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0

• Test Item– Input: 1 1 1 1 Output ? ? ? ?

Generalization


– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0


1 1 1 1 (Humans)

1 1 1 0 (Network)

Generalization


– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0


• Generalization fails because learning is local

1 1 1 1 (Humans)

1 1 1 0 (Network)

Generalization


– Input: 0 1 0 0 Output: 0 1 0 0

– Input: 1 1 1 0 Output: 1 1 1 0

– Input: 0 0 0 0 Output: 0 0 0 0


• Generalization succeeds because representations are shared

1 1 1 1 (Humans)

1 1 1 1 (Network)

Now another example…

Shared Representation

Copying 1:

Copying 2:

“The key to the representation of variables is whether all inputsin a class are represented by a single node.”

Generalization

• “In each domain in which there is generalization, it is an empirical question whether the generalization is restricted to items that closely resemble training items or whether the generalization can be freely extended to all novel items within some class.”

Syntax, Semantics, & Statistics

Starting Small Simulation

• How well does the network perform?

• How does it manage to learn?

Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.

Documents

Transcript of Neural Networks. Functions InputOutput 4, 48 2, 35 1, 910 6, 713 341, 257598.