A Connectionist Symbol Manipulator That Induces Rewrite

7/29/2019 A Connectionist Symbol Manipulator That Induces Rewrite

1/8

A CONNECTIONIST SYMBOL MANIPULATOR THAT INDUCES REWRI TERULES IN CONTEXT-FREE GRAMMARSSreerupaDas &Mchael C. M ozer'

Introduction. We describea connectionist architecture that is able to learn to parsestrings inacontext-free grammar (CFG) from positive and negative examples. To illustrate,consider the grammar in Figure la. The grammar consists of terminal symbob (denotedby upper case letters), nonterminal symbols (denoted by lower case letters), and rules thatspecify how a symbol string can be d u c e d to a nonterminal, for example, ab reduced toS. The grammar can be used to parse strings like aabb into a tree structureas shown inFigure lb. If aleft-to-right parsing strategy is taken then the stepsin the reduction processfor the stringaabbwould beas shown in Figure IC.

Our architecture attempts to learn explicit rewrite des n a grammar of the form inFigure la, to be able to reduce (or correctly parse) positive examplesas shown in FigureIC. This involves the ability to iteratively substitute a single nonterminal in place of astring of symbols, that is, reduce more than one symbols to one. Since this architecturetakes a left-to-right parsing strategy, it is suitable for LR grammars. Any CFG can beclassified as LR(n) grammar, which means that strings can be parsed fromleft to rightwith n symbols of lookahead. In the present bork, we examine only LR(0) grammars,although the architecture can be generalized to anyn.

SA bAA8 sA8 b

Figure1: (a) The rewrite rules i n a grammar for the languagea"b". ( b) A parse tree for thestringaabb. (c) The stages of left-to-right reduction. bctangles around the symbol s indicatethe sectionof thestring thathas been exposed to the parser so far. In the context of the network,the rectangles denote the symbolsthat were either presented to the network at an earlier timestep or were writtenon the scratch pad as aresult of a prior reduction step.

Giles et al. (1990), Sun et al. (1990), Das, Giles, and Sun (1992) have previouslyexplored the learning of CFGs using neural network model. Their approachwas based onthe automaton perspective of a recognizer, where the primary interest was to learn the

*Departmentof Computer Science & Institute of Cognitive Science, University of Col orado, Boulder,CO 80309-0430,USA

0 1993The Institutionof ElectricalEngineers. P 5/1Printed and publishedby the IEE,Savoy Place,LondonWCSR OBL, UK.

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply.


2/8

Figure 2: The network architecturedynamics of a pushdown automaton. There has also been work in CFG inference usingsymbolic approaches (for example, Cook & Rosenfeld, 1974; Crespi-Reghizzi, 1971; Fass,1983; Knobe& Knobe, 1976; Sakakibara, 1988). These approaches require a significantamount of prior information about the grammar and, although theoretically sound, havenot proven very useful in practice.Processing Mechanism. Once learning is complete, we envision a processing mecha-nism that has the following dynamics. An input string is copied into a linear scmtch padmemory. The purpose of the scratch pad is to hold the transitional stages of the stringduring the reduction process. A set of demons looks through the scratch pad, each lookingfor a specific pattern of symbols. A read-headdetermines the location on the scratch padmemory where the demons should focus their attention. When a demon finds its patternon the scratch pad, it fi res, which causes the elements of its pattern to be replaced by asymbol associated with that demon. This action corresponds to the reduction of a stringto a nonterminal symbol in accordance with a rule of the grammar. The read-head startsfrom the left-end of the string in the scratch pad and makes a right shift when none of thedemons fire. This process continues until the read-head has processed all symbols in thescratch pad and no demon can fire. The sequence of demon fi ri ngs provides informationabout the hierarchical structure of the string. If the string has been reduced correctly, thefinal contents of the scratch padwill simply be the root symbol S as illustrated in Figure IC.Archi tecture. The architecture consistsof a two layer network and a scratch pad mem-ory (Figure2). A set of demon units and a set of input units constitute the two layers inthe network. Each demon unit is associated with a particular nonterminal. Several demonsmay be associated with the same nonterminal, leading to rewrite rules of a more generalform, for example,X+ablYc. The read head of the scratch pad memory is implementedby the input units. At a particular time step, the input units make two symbols from thescratch pad memory visible to the demon units. If a demon recognizes the ordered pair

,


3/8

of symbols, it replaces the two symbols by the nonterminal symbol it represents. SincealCFGs can be formalized by rules that reduce two symbols to anonterminal, presentingonlytwo symbols to the demon units at a time places no restriction on the class of grammarsthat can be recognized. In our architecture, the scratch pad is implementedas a combina-tion of stack and an input queue, details of whichwill be discussed in a subsequent section.The architecture does require some prior knowledge about the grammars to be processed.The maximum number of rewrite rules and the maximum number of rules that have thesame left-hand side need to be specified in advance. This information puts a lower boundon the number of demons units the network may have.Continuous Dynam cs. So far, wehave described the model in a discreteway: demonunits firing is all-or-none and mutually exclusive, corresponding to the demon units achiev-ing a unary representation. This may be the desired behavior following learning, but neuralnet learning algorithms like back-propagation require exploration in continuous state andweight spaces and therefore need to allow partial activity of demon units. Therefore, theactivation dynamics for the demon units have been formulatedas follows.

Demon unit i computes the distance between the input vector, x, and the weight vector,W : did; =b;C (w ; j -~j )' ,whereb; isan adjustable bias associated with the unit. The activity of demon unit i , denotedby si , is computed via a normalized exponential transform (Bridle, 1990; Rumelhart, inpress)

,-di&

which enforces a competition among the units. A special unit, called the default unit, isdesigned to respond when none of the demon units fire strongly. Its activity, skf, s com-puted like that of any other demon unit with di&f =bkt. The activation of the defaultunit determines the amount of right shi ft to be made by the read-head.Continuous Scratch Pad Memor y. The two-to-one reduction of the symbols in thescratch pad shrinks the length of the partially reduced string. To incorporate this fact in thearchitecture, the scratch pad memory has been implementedas astack and an input queue.The stack holds the "seen" part of the input string that has been (completely or partially)processedso far. This corresponds to the sections of the string bounded with rectangles atvarious time steps in FigureIC.The "unseen" part of the string is contained in the inputqueue. The top two symbols on the stack are made visible to the network at aparticulartime step. Reduction of a pair of symbolson the scratch pad corresponds to popping thetop two symbols from the stack and pushingthe non-terminal symbol associated with thedemon unit that recognized the inputs. The left-to-right movement of the read-head isachieved by moving the next symbol fromthe input queueon the top of the stack. Thisoperation is performed when no demon units fire.

Since the demon units can be partially active, reading symbols and reduction of symbolsusing stack operations need to be performed partially. This can be accomplished with acontinuous stack of the sort used in Giles et al. (1990). Unlike a discrete stack where an

? 5/ 3



4/8

topofstack thiCknesS35A

Fompositesymbol 1

composi tesymbol 2L

.75

Figure 3: A continuous stack. The symbols indicate the contents; the height of a stack entryindicates its thickness, also given by the number to the right. The top composite symbol onthe stact isacombinationof the itemsformnga total thicknessof 10; thenext composite s acombinationof the itemsmaking up thenext 10 unitsof thickness.item is either present or absent in a cell, items can be present to varyi ng degrees. Witheach itemon the stack we associatea thickness,a scalar in the interval [0, 1]correspondingto what fraction of the element is present. An example of the continuous stack is shown inFigure3.

To understand how the thickness playsarole in processing, we digress briefly and explainthe encoding of symbols. Both on the stack and in the network, symbols are representedby numerical vectors that have one component per symbol. The vector representation ofsome symbol z, denoted by r,, has value1 or the component corresponding to z and 0 forall other components. If the symbol has thicknesst, the vector representation is tvm.

Although items on the stack have different thicknesses, the network is presented withcomposite symbobhaving thickness10 Composite symbolsareformed by combining stackitems. For example, in Figure3, composite symbol1s definedas the vector .2r,+.5rz+.3r,.The two symbols that the input units present to the network at every time step consists ofthe top two composite symbolson the stack.

The advantages of a continuous stack are twofold. First, it is required for networklearning; if a discrete stack were used, a small change in weights could result in a big(discrete) changein the stack. This was the motivation underlying the continuous stackmemory used by Giles et al. Second, a continuous stack is differentiable and hence allowsus to back propagate error through the stack during learning. Gileset al. did not considerback propagation through the stack.Pop. If a demon unit fires, the top two composite symbols should be popped from thestack corresponding to reduction (to make room for the demons symbol). If no demonfires, in which case the default unit becomes active, the stack should remain unchanged.These behaviors, as well as interpolated behaviors, are achieved by multiplying by a&f thethickness of all items on the stack contributing to the top two composite symbols. Sinces& j is 0 when one or more demon units are strongly active, the top two symbols gets erasedfrom the stack. When s&f is 1the stack remains unchanged.

At every time step, the network performs two operations on the stack:

Push The symbol written onto the stack is the composite symbol formed by summing

? 5/ 4


5/8

the identity vectors of the demon units, weighted by their activities: C,s;r;,where r, isthe vector representing demon's i ' s identity. Included in the summation isthe default unit,where r&f is defmed to be composite symbol over the thickness s&f of the input queue.Onceacomposite symbol is read from the input queueit is removed.Trai ni ng Methodology. The systemis trained on positive and negative examplesfroma context-free grammar. Its task is to learn the underlying rewrite rules and classify eachinput as grammatical or ungrammatical. Once a positive string is correctly parsed thesymbol remaining on the stack should be the symbol corresponding to the root symbol S(as in Figure IC).For a negative example, the stack should contain any symbol other thanS.

These criteria can be translated into an objective function as follows. If we assume aGaussian noise distribution over outputs, the probability that the stack contains the symbolS following presentation of examplei is:

where c; is the vector representing the symbol under the read-head; and the probabilitythat the total thickness of the symbols on stack is 1(i.e., the stack contains exactly oneitem) is:

Pihick (x e4(!l'i-l)2 9

whereT i is the total thickness of al l symbols in the stack and@ is a constant. For apositiveexample, the objective function should be greate'st when there i s a high probability of Sbeing in the stack and a high probabil;ty of it being the sole symbol in the stack; fora negative example, the objective function should be greatest when either event has lowprobability. We thus obtain a likelihood objective function whose logarithm the learningprocedure attempts to maximize:

The derivative of the objective function is computed with respect to the weight pa-rameters usinga form of back propagation through time (Rumelhart, Hinton,& Williams,1986). This involves "unfolding" the architecture in time and back propagation through thestack. Weights are then updated to perform gradient ascent in the log likelihood function.Simulation Details. Training set were generated by hand, withapreferences for shorterstrings. Positive examples were generated from the grammar; negative examples wererandomly generated by perturbing a grammatical string. For a given grammar, the numberof negative examples were much more than the number of positive examples. Therefore,positive examples were repeated in the training set so as to constitute half of the totaltraining examples.

The total number of demon units and the (fixed) identity of each was specified inadvance of learning. For example, for the grammar in Figure la, we provided at least twoS demon units and oneX demon unit. Redundant demon units beyond these numbers0 5/5

Authorized licensed use limited to: Tec de Monterrey. Downloaded on July 23, 2009 at 19:41 from IEEE Xplore. Restrictions apply .


6/8

S

grammara" b" rewrite rulesS -+ ablaXX -+SbI parenthesis balancing I S -+( ) [ (X ISS

postfixx -+S)s -+aXlSX

X -+b+ IS+pseudo-nlp S --+Nv(nV

Table1: Some results

S X

Figure 4: Sample weights for thegrammar anbn. Weights are organized by demon units, whoseidentities appear above the rectangles. The left and tight halves of the three rectangles representconnections from composite symbols1and2, respectively. Thedarker the shading is of a rectangle,the l arger the connection strengthis fromthe input unit representing that symbol to thedemonunit. The weightsclearl y indicate the three rules in thegrammar.did not degrade the network's performance. The initial weights ( w i j } were selected from auniform distribution over the interval [.45, .55]. The biases, bj , were initialized to 1.0.

Before an example is presented, the stack is reset to contain only a single symbol, thenul l symbol, with vector representation 0 and infinite thickness. The example string isplaced in the input queue. The network is then allowed to run for for 21- time steps,which is exactly the number of steps requires to process any grammatical string of length1. For example, the string aabb in Figure IC akes 6 steps to get reduced to S.Resul t s. We have successfully trained the architecture on a variety of grammars. Someof them are listed in table1. In each case, the network was able to discriminate positiveand negative exampleson the training set. For the fist threegrammars, additional stringswere used to test network generalization performance. The generalization performance was100% n each case.

Due to the simplicity of the architecture - the fact that there i s only one layer ofmodifiable weights- the learned weights can often be interpretedas symbolic rewrite rules(Figure4) . It is a remarkable achievement that the numerical optimization framework of


7/8

of neural net learning can be used to discover symbolic rules (seealso Mozer & Bachrach,Although the current version of the model is designed for LR(0) CFGs, it can be ex-

tended to LR(n) grammars by including connections from the firstn composite symbolsinthe input queue to the demon units. However, our focus i s not necessarily on building atheoretically powerful formal language recognizer and learning systems; rather, our interesthas been on integrating symbol manipulation capabilities into a neural network architec-ture. The model has the ability to represent a string of symbols with a single symbol, andto do so iteratively, al l ow ng for the formation of the hierarchical and recursive structures.This is the essence of symbolic information processing, and, in our view, a key ingredientfor structure learning.

1991).

Acknowledgement. Ths research was supported by NSF Presidential Young Investi-gator award IN-9058450 and grant 90-21 from the James S. McDonnell Foundation. Wethanks Paul Smolensky, C. Lee Giles, and hugen Schmidhuber for helpful comments re-gardi ng this work.BibliographyCook, C.M. & Rosenfeld, A. (1974). Some experiments in grammatical inference, NA-TO ASI on Computer Oriented Learning Processes, Bonus, hnce, pp. 157.Crespi-Reghizzi, C. (1971). An effective model for grammatical inference, Pmceedings ofI F I P Congress, Ljublijana, pp. 524.Das, S, Giles, C.L., & Sun, G.Z. (1992). Learning context-free grammars: Capabilitiesand limitations of neural network with an external stack memory. In Proceedings of theFourteenth Annual Conference of the Cognitive Science (pp. 791). Hillsdale,NJ : Erlbaum.Fass, L.F. (1983).SIGACT News, 15, p.24.

Learning Context-Free Languages from their structured sentences,

Giles, C.L., Sun, GZ., Chen, H.H., Lee, Y .C., & Chen, D.(1990). Higher order recur-rent networks and grammatical inference. In D.S. Touretzky (Ed.), Advances in NeuralInformation Systems 2 (pp. 380). San Mateo, CA: Morgan Kaufmann.Hinton, G.E. (1988). Representing part-whole hierarchies in connectionist networks. InProceedings of the Eighth Annual Conference of the Cognitive Science Society. Hillsdale,NJ :Erlbaum.Knobe, B. & Knobe, K. (1976). A method for inferring context-free grammars, Infor-mation and Control, 31, pp.129.Mozer, M.C. & Bachrach, J . (1991). SLUG: A connectionist architecture for inferring

Y 5/ 7



8/8

the structure of finite-state environments. Machine Learning, 7, pp. 139.Mozer, M.C. (1992). Induction of multiscale temporal structure. In J.E. Moody, S.J.Hanson, and R.P. Lippmann (Eds.), Advances in Neuml Information Processing Systems4 (pp. 275). San Mateo, CA: Morgan Kaufmann, 1992.Mozer, M.C. & Das, S. (1993). A connectionist symbol manipulator that discovers thestructureof context-free languages. In C.L. Giles, S.J. Hanson,& J .D Cowan (Eds.), Ad-vances in Neuml Information Processing Systems 5,To appear.Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning internal representa-tions by error propagation. In D.E. Rumelhart& J.L. McMillan (Eds), Pamllel distributedprocessing: Ezplomtions in the microstructure of cognition. Volume I : Foundations (pp.318). Cambridge, MA: MIT Press/Bradford Books.Sakakibara, Y. (1988). Learning Context-F'ree Grammars from Structural Data, Proceed-ings of the 1988workshop on Computational Learning Theory (pp330). Morgan KauffmanPublishers.Sun, G.Z., Chen, H.H., Giles, C.L., Lee, Y.C., & Chen, D. (1990). Connectionist push-down automata that learn context-free grammars. In Proceedingsof the Intemational J ointConference on Neuml Networks (pp. 1-577). Hillsdale,NJ : Erlbaum Associates.

A Connectionist Symbol Manipulator That Induces Rewrite

Documents

Transcript of A Connectionist Symbol Manipulator That Induces Rewrite