Evaluation of ICOT's natural language research

6

Click here to load reader

Transcript of Evaluation of ICOT's natural language research

Page 1: Evaluation of ICOT's natural language research

Future Generation Computer Systems 9 (1993) 137-142 137 North-Holland

Evaluation of ICOT's language research

natural

James Barnett a,, and Kenji Yamada a,b a MCC, 3500 West Balcones Center Drive, Austin, TX 78759, USA b AIso at: DEC Japan, 134 Goudo-cho, Hodogaya, Yokohama, Japan

Abstract

The Natural Language Processing group at ICOT was small and underwent a considerable turnover in personnel. As a result, the topics of research were quite diverse, though the general areas of interest were similar to those in the West during the same period. The primary emphases were on parsing, the development of grammar formalisms, and the design of an experimental discourse processing system. One feature of interest was the incorporation into their grammar formalisms of the traditional Japanese linguistic theory. Some of the software and the linguistic data are publicly available as a part of the ICOT Free Software and are useful to researchers who are working on Japanese.

Keywords. Natural language processing; parsing; Japanese grammar.

1. Introduction

This paper will give an overview of ICOT's research in natural language processing (NLP). It should be emphasized from the beginning that the core NLP group was quite small relative to other projects at ICOT. 1 Early plans for the ICOT NLP project included the construction of a large machine-readable dictionary ([38], but that project was spun-off into the EDR (Electronic Dictionary Research Center) lexicon project, which was many times the size of the group that remained with the Fifth Generation Project. ICOT also provided funding to external NLP projects, so the total group of researchers with some connection to ICOT was large, but in-house NLP work was not a major part of the Fifth

Generation Project. In general, the natural lan- guage work seems to have been viewed as an interface to the core Fifth Generation system rather than as a major research focus in its own right.

Within the NLP group, the research was not focused on any single linguistic topic, in part due to a considerable turn-over of personnel. There was a combination of theoretical and more practi- cal work, but the practical work was not necessar- ily an implementation of the theoretical. There was a general commitment to logic progr~imming (and hence to Definite Clause Grammars [23]), but only Matsumoto's work on parallel logic pars- ing, Mukai's work on CIL, and cu-Prol0g (see below) focused on issues that are specific to do- ing natural language analysis in a logic program- ming framework.

* Corresponding author. I At its largest, the NLP group consisted of fewer than 20

researchers and developers, and it was often much smaller than that.

2. The techniques chosen by ICOT

We can divide the research at ICOT into two parts, depending on whether the work was fo-

0376-5075/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

Page 2: Evaluation of ICOT's natural language research

138 J. Barnett, K. Yamada

cused on the particularities of Japanese or not. We will discuss the non-language-specific tech- niques first.

Algorithmic substrate The main developments in the algorithmic substrate of the ICOT NLP work were the SAX parser, the Complex Inter- mediate Language (CIL), and cu-Prolog. SAX was a pseudo-parallel Prolog implementation of the PAX algorithm, which was a parallelization of an earlier Prolog bottom-up chart parsing algo- rithm [14]. The parsing algorithm was extended to a range of linguistically interesting grammars [18,16], and is therefore of considerably more practical benefit than algorithms that are re- stricted to context-free grammars. CIL [20] is an extension of Prolog with a freeze operator and parameterized types (complex indeterminates). It was designed to support reasoning with incom- plete information, which is a central concept in Situation Semantics. More recently there has been work on constraint-logic programming ap- proaches to natural language processing [21,8], culminating in the development of cu-Prolog [32] to support the parsing of constraint-based gram- mars. cu-Prolog incorporates user-defined predi- cates as constraints and has been used to imple- ment a parser for Japanese Phrase Structure Grammar (see below.)

Semantic framework The primary semantic framework was Situation Semantics [3]. On the whole semantics received considerably less atten- tion than syntax did, and the semantic work fo- cused on developing a computational implemen- tation of the theory [22,20], rather than on specif- ically linguistic issues involving the nuances of the semantics of Japanese.

Complete systems The DUALS system was the only complete NLP system built as part of the Fifth Generation Project [36,30]. It was a proto- type system whose goal was to read and answer questions about grade-school and junior-high school textbooks. There were three versions of the DUALS system, the last two of which used the SAX parser, and CIL/Situation Semantics for semantic representation. The syntactic frame- work changed, however, from l_~xical-Functional Grammar [9] in the first version to a form of Relational Grammar which incorporated ele-

ments of Kokugo-gaku (see below). The architec- ture of the system provided for the use of both linguistic and world knowledge to support text understanding, but in practice almost all the knowledge was linguistic. Processing was divided into roughly four stages: syntactic/semantic anal- ysis, object identification (e.g. dealing with ellipsis and null pronouns), discourse integration (build- ing a model of the discourse), and finally question answering/response generation. The final ver- sion of the system (DUALS-Ill) had a vocabulary of about 2000 words and handled a corpus of approximately 200 sentences.

The structure of Japanese Before discussing the aspects of ICOT's research that were specific to Japanese, we will present a brief overview of the ways Japanese is different from western Euro- pean languages. First of all, Japanese is written in multiple alphabets (Kanji, Hiragana, and Katakana) and without spaces between the words. Kanji (Chinese characters) are used mainly for verb stems and nouns, while Hiragana and Katakana are used for inflectional morphology and foreign words, respectively. However, these are only rough distributional tendencies, so that simply identifying word boundaries is a nontrivial problem. Syntactically, the basic constituent is called a bunsetsu and generally consists of a noun or verb with following particles which carry modal information and serve as case markers. The order of the bunsetsu within a sentence is relatively free, except that the bunsetsu for the main verb always goes at the end. Furthermore, nominal bunsetsu can be freely omitted. As an example of this, consider the Japanese equivalent of 'Taro found Hanako.' We will use uppercase characters to indicate parts that would be written in Kanji, and lowercase to indicate Hiragana. We can say either TAROwaHANAKOwoMItsuketa or HANA- KOwoTAROwaMItsuketa. There are three bun- setsu in this sentence; TAROwa, HANAKOwo, and Mltstuketa ('found'). Here -wa is a topic marker which is used here to indicate subjective case marking. -wo indicates objective case mark- ing, and ta inidicates past tense. Note that either order of the topic/subject and object bunsetsu is allowed. Furthermore, either can be omitted (roughly in the same situations where we would use a pronoun in English), giving HANAKOwoM- Itsuketa or TAROwaMItsuketa. The last sentence

Page 3: Evaluation of ICOT's natural language research

Evaluation of ICOT's natural language research 139

is ambiguous, since the topic marker wa can be interpreted as indicating either the subject or the object (or just the general topic of the discussion). Thus TAROwaMltsuketa can be taken to mean either 'Taro found (something)' or '(someone) found Taro.'

Parsing Japanese thus entails a variety of tasks that don't come up with Western languages: first the input must be segmented into words, then nominal bunsetsu must be attached to the appro- priate verbs, with allowances made for varying orders and missing arguments, and finally the deep case relations ('semantic' subject and object, etc.) must be determined.

the traditional Japanese linguistic framework, which was developed over many centuries com- pletely independent of Western linguistics. RDG was quite different from LUG, but also incorpo- rated ideas from Kokugo-gaku. In particular, RDG relies on a looser concept of dependency between bunsetsu instead of the more rigid no- tions of phrase structure, which require adjacency of constituents. Each bunsetsu is classified by dependency type, and constraints on dependency are utilized to restrict the space of possible de- pendency relations. Sample grammars and! parsers for both LUG and RDG were implemented and are available as part of the ICOT free software.

Syntactic frameworks ICOT worked in a vari- ety of syntactic frameworks, each with a different approach to these problems. In the early years of the project there was a working group devoted to Japanese Phrase Structure Grammar (JPSG), which was an attempt to extend a popular West- ern syntactic framework, namely Generalized Phrase Structure Grammar [6], to Japanese [19]. JPSG basically inherits GPSG's principle- and feature-based formalism but extends them to handle specific Japanese phenomena. To handle the free word order, for example, the SUBCAT feature principle was extended to use a set-oper- ation. Only a small fragment of JPSG was actu- ally implemented, and work in this framework at ICOT seems to have tapered off in recent years.

There was also work on context-free grammars and gapping grammars in Definite Clause Gram- mars (DCG) form as part of the development of the SAX system [18,15]. Since all these for- malisms assume words as input, the LAX mor- phological system was developed to segment and tag the input [28]. LAX used a lexicon to identify possible words and morphological connection ta- bles to constrain the assignment of categories.

In the later years of the project, two further grammatical frameworks were developed; Local- ized Unification Grammar (LUG, [25,26]) and Restricted Dependency Grammar (RDG, [5]). LUG was a phrase-structure grammar in DCG form, but unlike standard Western phrase-struc- ture grammars (or SAX and LAX), morphologi- cal analysis and segmentation of the input were done as part of the parsing process. This amalga- mation of syntactic and morphological analysis reflected a recent development in Kokugo-gaku,

External research In addition to its internal research, ICOT provided partial funding to a variety of external NLP projects. Among these projects were ones on learning grammars from parse trees (Fujitsu, [24]), the use of prosodic information in speech recognition (Hitachi, [10]), the development of coherence rules for Japanese discourse (Toshiba, [33]), as well as worl[ on an NL-Based Information Retrieval system (the IRIS system at Fujitsu, [31]), a Text Summarization system (the COGITO system at Oki, [11]), a sen- tence retrieval system using grammatical l~atterns (the KWlC system from Sharp), and two question and answer systems (the IDS system at M~subishi and the ToR system at Matsushita, [12,37]).

Software releases Much of ICOT's NLP soft- ware has been made available to outside re- searchers. In 1988, the various subsystems of the DUALS system were grouped together and re- leased as the Language Tool Box [29,1]. More recently, another release of NLP software; includ- ing LUG, RDG, a 150,000 word morphblogical lexicon, CIL and Cu-Prolog, has been made avail- able as part of the ICOT Free Software distribu- tion.

3. Comparison with other work

The most striking difference between the ICOT NLP work and other work in Japan was the lack of emphasis on machine translation at! ICOT. This de-emphasis was the result of a conscious decision at MITI, which put its funding for ma- chine translation and large-scale system-building

Page 4: Evaluation of ICOT's natural language research

140 J. Barnett, BL Yamada

elsewhere: EDR (Japan Electronic Dictionary Research Institute), ATR (Interpreting Tele- phone Research Laboratories), CICC (The Cen- ter for International Cooperation for Computeri- zation) and other organizations are receiving sub- stantial governmental funding for machine trans- lation research and large-scale system develop- ment. The small scale and relatively theoretical focus of the ICOT project was the result of organizational factors rather than a lack of gov- ernment interest in commercially utilizable NLP.

On the whole, the ICOT NLP work was simi- lar to work in the West during the same period. In particular, the primary emphasis on grammars and parsing, with semantics and pragrnatics as a secondary topic, is quite similar to the overall distribution of effort in the West during the 80s. A recent interest in constraint grammars is an- other area of commonality between ICOT re- searchers and their colleagues in the West (e.g. [35,13].) In terms of overall architecture and func- tionality, the DUALS system is comparable to systems developed in the West over the same period (e.g. [7,4,2].) Natural Language Genera- tion (as opposed to understanding) was treated as a poor relative in both ICOT and the West dur- ing this period, though it is one field in which substantial progress has been made recently, par- ticularly in the area of reversible generation algo- rithms (e.g. [27,34].)

The obvious difference between the Fifth Gen- eration NLP work and work in the West is ICOT's emphasis on Japanese. Here the most interesting twists are in the projects that rely on the tradi- tional grammatical framework of Kokugo-gaku. Western syntactic frameworks have been devel- oped primarily for work on European languages (overwhelmingly English, in fact). Such frame- works certainly have enough formal power to handle Japanese, but they don't do so very natu- rally. The phenomena of English that have at- tracted the most attention (e.g. wh-movement) are peripheral in Japanese, while the relatively free word order, frequency of ellipsis and the loose topic/comment structure of Japanese don't fit into a phrase structure framework very com- fortably. In the face of such problems, the incor- poration of the indigenous grammatical tradition is a promising approach.

4. Evaluation

The ICOT NLP effort was small and not fo- cused on any single task. It therefore doesn't make sense to talk of an "ICOT approach"; there were roughly as many approaches to as many different problems as there were researchers. It would also be inappropriate to talk of success or failure, since there was no clear overall goal established for the NLP effort. The NLP work did follow the general ICOT plan, in that the work fell into roughly three stages: Initial (82-84), Intermediate (85-88) and Final (89-92.) The first stage was characterized by theoretical work and some preliminary efforts with DCGs. The second stage was the most productive, with work on SAX, CIL, DUALS, and the Language Tool Box. The NLP group seems to have gotten smaller in the third stage, when the main results were Lo- calized Unification Grammar, Restricted Depen- dency Grammar, the work on constraint gram- mars, and some small prototype discourse pro- cessing and generation systems. There was some continuity in theme and scaling up of effort be- tween the first and second stages, with the DU- ALS system serving as the focal point for the various technologies. The third stage, however, seems to have been quite disconnected from the earlier ones, and lacking in any internal focus (work on the DUALS system stopped altogether).

Given these discontinuities in theme and scale, it is probably better to think of the NLP effort as a kind of post-doc research program, with a high turnover of researchers coming and going to and from universities and industry. From this perspec- tive, ICOT's main role was to train researchers and disseminate technology. The work of the individual researchers, however, must be evalu- ated separately, and the resulting judgements will be necessarily subjective and dependent on the interests of the judges. From our personal point of view, we would emphasize the SAX parser and the incorporation of Kokugo-gaku as highlights of the work at ICOT. More objectively, we note that the free NLP software that ICOT has provided is a central part of a movement in Japan to make NLP resources available (the Information Pro- cessing Society of Japan has recently founded a study group on this topic.) The variety of avail-

Page 5: Evaluation of ICOT's natural language research

Evaluation of ICOT's natural language research 141

able tools, including the LUG and RDG gram- mars, the SAX parser, JUMAN morphology ana- lyzer (a successor of LAX, developed at the Uni- versity of Kyoto, [17]), etc. do not form a com- plete, off-the-shelf NLP system, but are quite useful to researchers who are working on Japanese.

References

[1] K. Akasaka, Y. Kudo, F. Fukumoto, H. Fukushima and K. Hagiwara, Language Tool Box (LTB): A program library of NLP tools, Technical Report TR-521, ICOT, 1989.

[2] J. Barnett, I. Mani, K. Knight and E. Rich, Knowledge and natural language processing, Commun. ACM (Aug. 1990).

[3] J. Barwise and J. Perry, Situations and Attitudes (MIT Press, Cambridge, 1983).

[4] D. Dahl, M. Palmer and R. Passoneau, Nominalizations in PUNDIT, in: Proc. ACL (1987).

[5] F. Fukumoto, H. Sano, Y. Saitoh and J. Fukumoto, A framework for restricted dependency grammar based on the word's modifiability level-restricted dependency grammar, Trans. Informat. Processing Soc. Japan 33(10) (1992) (in Japanese).

[6] G. Gazdar, E. Klein, G. Pullum and I. Sag, Generalized Phrase Structure Grammar (Basil Blackwell, Oxford, 1985).

[7] B. Grosz, D. Appelt, P. Martin and F. Pereira, TEAM: An experiment in the design of transportable natural-lan- guage interfaces, Artificial Intelligence 32(2) (1987).

[8] K. Hasida, Common heuristics for parsing, generation and whatever .... in: Proc. Workshop on Reversible Gram- mars in Natural Language Processing (1991).

[9] R. Kaplan and J. Bresnan, Lexical-functional grammar: A formal system for grammatical representation, in: J. Bresnan ed., The Mental Representation of Grammatical Relations (MIT Press, Cambridge, 1982).

[10] A. Komatsu, E. Oohira and A. Ichikawa, Prosodical sentence structure inference and word spotting for con- versational speech understanding, Technical Memoran- dum, TM-391, ICOT, 1987.

[11] E. Komatsu, Y. Kato, H. Yasuhara and T. Shiino, Sum- marization support system COGITO, Technical Memo- randum, TM-415, ICOT, 1987 (in Japanese).

[12] S. Kondo and M. Imamura, IDS: Cooperative responses based on dialogue model, Technical Memorandum, TM- 469, ICOT, 1988 (in Japanese).

[13] P. Marrafa and P. Saint-Dizier, Reversibility in a con- straint and type based logic grammar: Application to secondary predication, in: Proc. ACL Workshop on Re- versible Grammars (1991).

[14] Y. Matsumoto, A parallel parsing system for natural language analysis, Technical Report TR-146, ICOT, 1985.

[15] Y. Matsumoto, Parsing gapping grammars in parallel, Technical Report TR-318, ICOT, 1987.

[16] Y. Matsumoto, Natural language parsing systems based on logic programming, PhD thesis, Univ. of Kyoto, 1989.

[17] Y. Matsumoto, S. Kurohashi, T. Utsuro, Y. Myoki and M. Nagao, User's Guide for the JUMAN Japanese Morpho- logical Analysis System, Version 1.0, Kyoto University, Nagao Lab. (Jan. 1993) (in Japanese).

[18] Y. Matsumoto and R. Sugimura, A parsing system based on logic programming, Technical Report TR-252, 1COT, 1987.

[19] H. Miyoshi, T. Gunji, H. Sirai, K. Hashida and Y. Harada, A phrase structure grammar for Japanese - JPSG, Corn- put. Software 3(4) (1986) (in Japanese).

[20] K. Mukai, Horn clause logic with parameterizetl types for situation semantics programming, Technical Report TR- 101, ICOT, 1985.

[21] K. Mukai, Discourse understanding and logic, Technical Memorandum, TM-567, ICOT, 1988.

[22] K. Mukai, H. Yasukawa, H. Miyoshi and H. Hirakawa, A computational model for situation semantics and its real- ization, Technical Memorandum, TM-051, ICOT, 1984.

[23] F.C.N. Pereira and D.H.D. Warren Definite clause gram- mars for language analysis - A survey of the formalism and a comparison with augmented transition network, Artificial Intelligence 13 (1980).

[24] Y. Sakakibara, An efficient learning of context-free grammars from positive structural examples, Technical Report TR-488, ICOT, 1989.

[25] H. Sano and F. Fukumoto, Localized unification gram- mar and its representation, in: Proc. 41st Annual Conf. Information Processing Soc. Japan (1990) (in Japanese).

[26] H. Sano, F. Fukumoto and H. Onodera, An integrated framework for grammar formalism in a unification-based approach, Technical Memorandum, TM-993, ICOT, 1990.

[27] S. Shieber, G. van Noord, R. Moore and F. Pereira, Semantic head-driven generation, Computat. Linguistics 16(1), (1990).

[28] R. Sugimura, K. Akasaka, Y. Kubo, Y. Matsumoto and H. Sano, Logic based lexical analyzer LAX, Technical Report TR-362, ICOT, 1988 (in Japanese).

[29] R. Sugimura, K. Akasaka, T. Okunishi, Y. Kubo, K. Hatano, Y. Tanaka and T. Takizuka, Configuration of the language Tool Box, in: Proc. 37th Annual Conf. Information Processing Soc. Japan 1988 (in Japanese).

[30] R. Sugimura, K. Akasaka, Y. Tanaka, IC Hasida K. Mukai and Y. Kubo, Natural language processing in the experimental discourse understanding system DUALS- III, Technical Report TR-494, ICOT, 1989.

[31] K. Sugiyama et al., IRIS: An intelligent information retrieval system based on natural language understand- ing, Technical Report TR-210, ICOT, 1986.

[32] H. Tsuda, cu-Prolog for Constraint-Based Grammar, in: Proc. the Internat. Conf. Fifth Generation Computer Sys- tems 1992 (1992).

[33] T. Ukita, K. Ono and S. Amano, Computational analysis of linguistic discourse structure for Japanese text, Tech- nical Memorandum, TM-657, ICOT, 1989.

[34] G. van Noord, An overview of head-driven bottom-up generation, in: R. Dale, C. Mellish and C. Zock, eds., Current Research in Natural Language Generation (Academic Press, New York, 1990).

Page 6: Evaluation of ICOT's natural language research

142 .I. Barnett, IC Yamada

[35] G. van Noord, Reversibility in Natural Language Process- ing, PhD thesis, University of Utrecht, 1992.

[36] H. Yasukawa, H. Hirakawa, K. Mukai, H. Miyoshi and Y. Tanaka, Outline of the discourse understanding system DUALS, Technical Memorandum, TM-118, ICOT, 1985.

[37] H. Yasukawa and H. Suzuki, USSR: Situation oriented semantic representation language, Technical Memoran- dum, TM-508, ICOT, 1988 (in Japanese).

[38] T. Yokoi, K. Mukai, H. Miyoshi and Y. Tanaka, Re- search activities in natural language processing of the FGCS Project, Technical Report TR-190, ICOT, 1986.

James Barnett is Project Manager of the Knowledge Based Natural Lan- gnage Project at MCC. He recently received his PhD in Linguistics from the University of Texas at Austin.

After receiving a B.A. from Foreign Language Department of Sophia Uni- versity, Tokyo, in 1985, Kenil Yamada joined Digital Equipment Corpora- tion Research and Development Cen- ter in Japan. He has been assigned to Microelectronics and Technology Corporation since 1992.