Training the Model - The University of EdinburghAgain, the connectionist model directly predicts...

Cognitive ModelingLecture 13: Connectionist Networks: Models of

Language Processing

Frank KellerSchool of Informatics

University of Edinburgh

[email protected]

Cognitive Modeling: Connectionist Models of Language Processing – p.1

Overview

� Reading aloud: orthography-phonology mapping;

� models adult performance;

� good performance on known and unknown words;

� models (normal) human behavior;

� fails to replicate the double-dissociation (in dyslexia);

� importance of input and output representations.

Reading: McLeod et al. (1998, Ch. 8).


Reading Aloud

Task: produce the correct pronunciation for a word, given itsprinted form.

Suited to connectionist modeling:

� we need to learn mappings from one domain (print) toanother (sound);

� multi-layer networks are good at this, even whenmappings are somewhat arbitrary;

� human learning is similar to network learning: takes placegradually, over time; incorrect attempts later corrected.

If a network can’t model this linguistic task successfully, it wouldbe a serious blow to connectionist modeling.


Dual Route Model

Standard model: two independentroutes leading to pronunciation, be-cause:

� people can pronounce wordsthey have never seen: SLINT orMAVE;

� people can pronounce wordswhich break the rules: PINT orHAVE.

One mechanism uses general rules for pronunciation; the otherone stores pronunciation information with specific words.


[email protected]

Behavior of Dual-Route Models

Consider: KINT, MINT, and PINT

KINT is not a word:

� no entry in the lexicon;

� an only be pronounced using the rule-based mechanism.

MINT is a word:

� can be pronounced using the rule-based mechanism;

� but exists in the lexicon, so also can be pronounced by the lexicalroute.

PINT is a word, but irregular:

� Can only be correctly pronounced by the lexical route;

� Otherwise, it would rhyme with MINT.


Evidence for the Dual-Route Model

Double dissociation: neuropsychological evidence showsdifferential behavior for two types of brain damage:

� phonological dyslexia: symptom: words read without difficulty,but pronunciations for non-words can’t be produced; explanation:damage to rule-based route; lexical route intact;

� surface dyslexia: symptom: words and non-words producedcorrectly, but errors on irregulars (tendency to regularize);explanation: damage to lexical route; rule-based route intact.

All dual-route models share:

� a lexicon for known words, with specific pronunciationinformation;

� a rule mechanism for the pronunciation of unknown words.


Towards a Connectionist Model

It is unclear how a connectionist model could naturallyimplement a dual-route model:

� no obvious way to implement a lexicon to store informationabout particular words; storage is typically distributed;

� no clear way to distinguish specific information fromgeneral rules; only one uniform way to store information:connection weights.

Examine the behavior of a standard 2-layer feedforward model(Seidenberg and McClelland, 1989):

� trained to pronounce all the monosyllabic words ofEnglish;

� learning implemented using standard backpropagation.


Seidenberg and McClelland (1989)

2-layer feed-forward model:

� distributed representations at input andoutput;

� distributed knowledge within the net;� gradient descent learning.

Input and output:

� inputs are activated by the letters of the words: 20%activated, on average;

� outputs represent the phonological features: 12%activated, on average;

� encoding of features does not affect the success.

Processing: neti = ∑ j wi ja j +biasi; logistic activation function.


Training the ModelLearning:

� weights and bias are initially random;� words are presented, and outputs are computed;� connection weights are adjusted using backpropagation of error.

Training:

� all monosyllabic words of 3 or more letters (about 3000);� in each epoch, a sub-set was presented: frequent words

appeared more often;� over 250 epochs, THE was presented 230 times, least common

word 7 times.Performance:

� outputs were considered correct if the pattern was closer to thecorrect pronunciation than that of any word;

� after 250 epochs, accuracy was 97%.


Results

The model successfully learns to map most regular andirregular word forms to their correct pronunciation.

� it does this without separate routes for lexical andrule-based processing;

� there is no word-specific memory

It does not perform as well as human in pronouncingnon-words.

Naming Latency: experiments have shown that adult reactiontimes for naming a word is a function of variables such as wordfrequency and spelling regularity.


Results

The current model cannot directly mimic latencies, since thecomputation of outputs is constant.

The model can be seen as simulating this observation if werelate the output error score to latency.

� phonological error score is the difference between theactual pattern and the correct pattern;

� hypothesis: high error should correlate with longerlatencies.


Word Frequency Effects

Experimental finding: common words are pronounced morequickly than uncommon words.

Conventional (localist) explanation:

� frequent words require a lower threshold of activity for the“word recognition device” to “fire”;

� infrequent words require a higher threshold of activity.

In the S&M model, naming latency is modeled by the error:

� word frequency is reflected in the training procedure;� phonological error is reduced by training, and therefore

lower for high frequency words.

The explanation of latencies in terms of error follows directlyfrom the network s architecture and the training regime.


Frequency × Regularity

In addition to faster naming of frequent words, subjects exhibit:

� faster pronunciation of regular words (e.g., GAVE orMUST) than irregular words (e.g., HAVE or PINT);

� but this effect interacts with frequency: it is only observedwith low frequency words.

Regulars: small frequency effect; it takes slightly longer topronounce low frequency regulars.

Regulars: large frequency effect.

The model mimics this pattern of behavior in the error.

Dual route model: both lexical and rule outcome possible;lexical route wins for high frequency words (as it’s faster).


Frequency × Neighborhood Size

The neighborhood size of a word is defined as the number ofwords that differ by one letter.

Neighborhood size has an effect on naming latency:

� not much influence for high frequency words;� low frequency words with small neighborhoods are read

more slowly than words with large neighborhoods.

Shows cooperation of the information learned in response todifferent (but similar) inputs.

Again, the connectionist model directly predicts this.

Dual route model: ad hoc explanation, grouping across localistrepresentations of the lexicon.


Experimental Data vs. ModelFrequency × regularity:

Frequency × neighborhood size:

regular: filled circlesirregular: open squares

small neighborhood:filled circleslarge neighborhood:open squares


Spelling-to-Sound Consistency

Consistent spelling patterns: all words have the samepronunciation, e.g., _UST

Inconsistent patterns: more than one pronunciation, e.g., _AVE.

Observation: adult readers produce pronunciations morequickly for non-words derived from consistent patterns (NUST)than from inconsistent patterns (MAVE).

Dual route: difficult: both are pro-cessed by non-lexical route; wouldneed to distinguish consistent and in-consistent rules.

The error in the connectionist modelpredicts this latency effect.


Summary: Seidenberg and McClelland

What has the model achieved:

� single mechanism with no lexical entries or explicit rules;

� response to an input is a function of the network’s experience:previous experience on a particular word; experience with wordsresembling it.

Example: specific experience with HAVE is sufficient to overcome thegeneral information that _AVE is pronounced with a long vowel.

Network can produce a plausible pronunciation for MAVE; but error isintroduced by experience with inconsistent words like HAVE.

Performance: 97% accuracy on pronouncing learned words. Accountfor: interaction of frequency with regularity, neighborhood, consistency.

Limitations: not as good as humans at: (a) reading non-words (model60% correct, humans 90% correct); (b) lexical decision (FRAME is aword, but FRANE is not).


Representations are Important

Position specific representation: for inputting words ofmaximum length N: use N groups of 26 binary inputs.

But consider: LOG, GLAD, SPLIT, GRILL, CRAWL:

� model needs to learn correspondence between L and /l/;

� but L always appears in different positions;

� learning different pronunciations for different positionsshould be straightforward;

� alignment: letters and phonemes are not in 1-to-1correspondence.

Non-position-specific representation: loses important orderinformation: RAT = ART = TAR.


Representations are Important

Solution: S&M decompose word and phoneme strings intotriples of letters:

� FISH = _FI SH_ ISH FIS (note: _ is word boundary);

� each input is associated with 1000 random triples;

� active is that triple appears in the input word.

This representation is called Wickelfeatures.

S&M still suffer some specific effects: information learned abouta letter in one context is not easily generalized.


Improving the Model: Plaut et al. (1996)

Plaut et al.’s (1996) solution: non-position-specificrepresentation with linguistic constraints:

� monosyllabic word = onset + vowel + coda;

� strong constraints on order within these clusters, e.g., if tand s are together in the onset, s always precedes t;

� only one set of grapheme-to-phoneme units is required forthe letters in each group;

� correspondences can be pooled across different words,even when letters appear in different positions.


Improving the Model: Plaut et al. (1996)

Input representations:

� onset: first letter or consonant cluster (30): y s p t k q c b dg f v j z l m n r w h ch gh gn ph ps rh sh th ts wh;

� vowel (27): e I o u a y ai au aw ay ea ee ei eu ew ey ie oaoe oi oo ou ow oy ue ui uy q;

� coda: final letter or consonant cluster (48): h r l m n b d gcxf v j s z p t k q bb ch ck dd dg ff gg gh gn ks ll ng nn phpp ps rr sh sl ss tch th ts tt zz u e es ed.

Monosyllabic words are spelled by choosing one or morecandidates from each of the 3 possible groups:

Example: THROW: (‘th’ + ‘r’), (‘o’), (‘w’).


Output Representations

Just like the input representation, the output representation isalso subject to linguistic constraints:

Phonology: groups of mutually exclusive members:

� onset (23): s S C; z Z j f v T D p b t d k g m n h; l r w y q;

� vowel (14): a e i o u @ ∧ A E I O U W Y q;

� coda (24): r; s z; l; f v p k; m n N; t; b g d; S Z T D C j; psks ts.

Example: /scratch/ = s k r a _ _ _ _ _ _ _ _ C.


Network Architecture and Performance

The architecture of the Plaut et al. (1996) network:

� a total 105 possible orthographiconsets, vowels, and codas;

� 61 possible phonological onsets,vowels and codas.

Performance of the Plaut et al. (1996) model:

� succeeds in learning both regular and exception words;

� produces the frequency × regularity interaction;

� demonstrates the influences of frequency andneighborhood size.


Network Architecture and Performance

What is the performance on non-words?

� for consistent words (HEAN/DEAN): model (98%) versushuman (94%);

� for inconsistent words (HEAF/DEAF/LEAF): model (72%),human (78%);

� this reflects production of regular forms: both human andmodel produced both.

Highlights the importance of encoding and how muchknowledge is implicit in the coding scheme.


Discussion

Word frequencies:

� Seidenberg and McClelland (1989) presented trainingmaterials according to the log frequencies of words;

� people must deal with absolute frequencies which mightlead the model to see low frequency items too rarely;

� Plaut et al. (1996) model, however, succeeds withabsolute frequencies.

Doesn’t explain the double dissociation:

� models surface dyslexia (can read exceptions, but notnon-words);

� doesn’t model phonological dyslexia (can pronouncenon-words, but not irregulars).


Discussion

Representations:

� the right encoding scheme is essential for modeling thefindings: how much linguistic knowledge is given to thenetwork by Plaut et al.’s encoding?

� they assume this knowledge could be partially acquiredprior to reading, i.e., children learn to pronounce wordsbefore they can read them;

� doesn’t scale to polysyllabic words.


Summary

� Dissociations in performance do not necessarily entail distinctmechanisms;

� reading aloud model: a single mechanism explains regular andirregular pronunciation of monosyllabic words;

� but: explaining double dissociations is difficult (has been shownfor small networks, but unclear if larger, more plausible, networkscan demonstrate double dissociations);

� connectionist models excel at finding structure and patterns in theenvironment: ‘statistical inference machines’:� the start state for learning may be relatively simple,

unspecified;� necessary constraints to aid learning come from the

environment (or from the representations used).

� Can such models scale up? Are they successful for languageswith different distributional properties?


References

McLeod, Peter, Kim Plunkett, and Edmund T. Rolls. 1998. Introduction to Connectionist

Modelling of Cognitive Processes. Oxford: Oxford University Press.

Plaut, David C., James L. McClelland, Mark S. Seidenberg, and Karalyn Patterson. 1996.

Understanding normal and impaired word reading: Computational principles in

quasi-regular domains. Psychological Review 103: 56–115.

Seidenberg, Mark S., and James L. McClelland. 1989. A distributed developmental model of

word recognition and naming. Psychological Review 96: 523–568.


Training the Model - The University of EdinburghAgain, the connectionist model directly predicts...

Documents

Transcript of Training the Model - The University of EdinburghAgain, the connectionist model directly predicts...