Natural language processing with LOLITA

5

Click here to load reader

Transcript of Natural language processing with LOLITA

Page 1: Natural language processing with LOLITA

Natural language processing with LOLITA Richard Morgan and Roberto Garigliano

The ability of computers to process information and communicate with users has been severely limited by their inability to analyse or generate free format written or spoken language. Whereas the techniques needed to deal with highly constrained ‘formal’ languages are well understood, the techniques for dealing with written text or spoken word have yet to mature. The processing of such information is commonly referred to as Natural Language Processing (NLP). The LOLITA* system is a prototype NLP system that can analyse and generate written English. The potential for such a system is enormous and is demonstrated by some sample applications which are built into the current prototype.

The availability of NLP would have a dramatic impact on both the uses to which computer systems can be put and the ease with which they can be used [ 11. In terms of the ‘uses’, the vast majority of information transfer and storage that takes place in a modern society is in terms of written or spoken natural language. Although current computer systems can handle written infor- mation in a crude way, they are limited to operations on the superficial structure of the information, made up of entities such as the characters, words, lines, etc. However, the importance of the written information lies in its meaning, and hitherto any operations on the meaning must be carried out by a human. For example, one operation that might be required is to search for and summarize information about a person’s whereabouts. It is difficult to see how such an operation can be specified in terms of the superficial word structure; however, a possible solution is to search for occur- rences of the required name that appear textually close to a location word, and out- put any such words. This solution goes badly wrong in sentences such as ‘John was not in London’.

In terms of the ‘usability’, much of our use of computer systems is hampered by the need to communicate with the computer in a

Richard Morgan, BSc, PhD

Received his BSc in Computer Science and Electronics from the University of Keele in 1985 and his PhD from Durham University in 1991. He is currently a lecturer in Computer Science at Durham University.

Roberto Garigliano, MSc, DPhil

Received his Laufea in fibsofia at the University of Genoa in 1976, his MSc in logic at the University of London in 1984 and his DPhil at Oxford University in 1986. He is currently a lecturer in Artificial Intelligence at Durham University.

*LOLITA is an acronym for: Large-scale, Object-based, Linguistic Interactor, Translator and Analyser.

way that is suited to the computer rather than to the human. Access to computing technology has already been greatly improved by the use of modem graphical user interfaces; however, there is still a huge gap between this and the type of interface which could be made available through NLP techniques. For example, even with the use of a modem graphical user interface, the steps required to produce a graph of company turnover for the last five years from a company database are highly involved, especially when compared to the corresponding natural language instruction. When interfacing to the user with natural language it is essential that the computer can deal with the natural language at the level of its meaning rather than its super- ficial structure.

The field of NLP is currently at a stage where it is possible to start developing prod- ucts from the research prototypes. In this paper, we concentrate on one such proto- type, the LOLITA system, and look at the products that could be developed from this prototype.

The LOLITA system The LOLITA system is an NLP system that has been developed at the University of Durham over the last seven years. It con- centrates on the processing of written rather than spoken natural language, and provides a core set of general purpose NLP facilities which can be used to build specific natural language applications. It also contains a number of prototype applications which make use of these core facilities. Since it is difficult to understand the core facilities without some idea of what they are used for, we present the prototype applications first and then describe the core facilities. These prototype applications include a query sys- tem and content scanning system.

The query system There is a vast potential for applications in which natural language is used as a means of communication between the application

and its users, thus increasing the accessi- bility of the application to nonspecialist users.

In such applications, the system must be capable of maintaining a dialogue with the user, responding to input such as questions, statements and commands. It must also be capable of maintaining a context in the dia- logue, thus allowing references to previous parts of the dialogue. LOLITA’s query sys- tem demonstrates the possibility of such applications by providing a statement/ques- tion dialogue which allows the user to add information to LOLITA’s memory by making statements and also allows the user to access LOLITA’s knowledge about the world by asking questions.

Example query session The following is an example dialogue with the query system. Initially, the user makes a statement and LOLITA responds with ‘I understand’ to signify that the statement makes sense according to LOLITA’s analy- sis, and has been recorded in memory. Note that in the query system, no attempt is made to assess the validity of the information on the basis of its source.

User: I own a big fast motorbike. LOLITA: I understand.

It is now possible to check that this infor- mation has been recorded by asking ques- tions about it. We can ask questions that are very close to the original statement (although even in this first example, the sys- tem must be able to deal with the fact that the big fast motorbike is a specialization of what we are asking for).

User: Do I own a motorbike? LOLITA: Yes, you own a big fast

motorbike.

In the next question, the link between the original statement and the question is much less obvious. To answer it, the system must recognize ‘to own’ as a specialization of ‘to

11

Page 2: Natural language processing with LOLITA

control’, and ‘vehicle’ as a generalization of ‘motorbike’.

User: Do I control a vehicle? LOLITA: Yes, you own a big fast

motorbike.

As well as producing yes/no style answers, the system can also provide answers to questions such as ‘what’, ‘which’ and ‘how many’.

User: How many vehicles do I own? LOLITA: At least 1.

Notice that here the system does not make what is known as the ‘closed world’ assumption. In other words the system does not assume that it knows everything, and so can only say that the user owns at least one vehicle.

As well as recording any facts asserted by the user, the system keeps track of the con- text of the interaction. This allows the sys- tem to respond correctly to continuations of previous utterances, as is shown by the following example:

User: I like you. LOLITA: I understand. User: And Keith too. LOLITA: I understand. User: Whom do I like? LOLITA: Keith and me.

The query system can also be used to extract information from pieces of text pre- viously analysed by the system. The follow- ing is a small example of the system’s ability to answer questions about a piece of text.

User:

LOLITA: User: LOLITA:

I was tired, so I called a taxi and went home. The cab was warm and the driver was friendly so I gave him a big tip. I understand. Why did I go home? You were tired.

Although this example demonstrates only the basic aspects of the query system, it gives some idea of the potential for build- ing applications which provide a natural language interface. Database applications are an obvious candidate for this technology,

Template: Incident Incident: Where :

When :

Responsible: Target: Damage : Human :

Thing :

Source :

Source Date: Certainty:

Figure 1 General form of incident template.

especially when users need rapid and flex- ible access to the data, but are not familiar with the appropriate query languages.

The content scanner Contents scanning involves searching through a number of reports in order to find information of relevance. Until recently there were only two methods for performing this search; manual and automated keyword search. Manual search is labour intensive, costly and time-consuming. Keyword search on text can drastically reduce the cost and increase the speed, but cannot cope with phenomena such as negative or hypothetical information, and as a result this method often produces unnecessary and inaccurate reports.

The content scanner in the LOLITA sys- tem aims to provide the accuracy of a manual search with the low cost of the key- word search. Due to the complexity of the task, it is inevitably slower than systems based on keyword search.

The LOLITA content scanner has been tested on newspaper articles for terrorist activities. This choice was made so that the system could conform to the standard accepted by the Defense Advanced Research Projects Agency (DARPA) spon- sored conferences on message understand- ing (MUC-1 to 5). The output of the content scanner is presented as a series of templates, containing information such as ‘where’, ‘when’ and ‘who’. There are currently three forms of template, one for terrorist

incidents, one for investigations and one to cover general terrorist-related information.

Each of these template forms has an over- all structure, with gaps left for information that the system should fill in (provided the information was present in the article being scanned). The system generates one tem- plate for each separate incident, investi- gation or general piece of information.

For example, Figure 1 shows the general form of an incident template. The first line simply indicates that the template is for an incident. The second describes what type of incident was found and the remaining lines give further details of the incident. The source line gives information about the source of the information (which news- paper); the source date gives information about when the text was published; and the certainty line allows the system to give some indication of how reliable the infor- mation is.

Figure 2 is an example of a newspaper article that has been processed by the sys- tem. It contains a number of interesting features. Firstly, there are three separate references to the incident itself (an ex- plosion). The first one appears as ‘A car bomb exploded’; the second appears as ‘the explosion’; and the final one as ‘the bomb went off’. Another feature of the article is that the information which needs to be gathered for the template is distributed throughout the text. For example: the first sentence gives time and location infor- mation; the second gives damage, time and location information; the final paragraph gives damage information. Finally, there is no explicit information about the target, although it seems likely that the targets were the Cabinet Office and 10 Downing Street.

Figure 3 shows the template produced by the LOLITAcontent scanner when given the newspaper article from Figure 2. The three references to the incident have been unified to give a single template containing nearly all of the relevant details. The only infor- mation missing is that it is very likely that the IRA were responsible. Each piece of information is given by generating the appropriate English text. Although this English is rather verbose, it is a simple matter to tell the system to produce a more concise template, for example by reporting the incident as ‘A bomb explosion’ rather than ‘The bomb explosion outside the

A car bomb exploded outside the Cabinet Office in Whitehall last night, 100 yards from 10 Downing Street. Nobody was injured in the explosion which happened just after 9pm on the corner of Downing Street and Whitehall. Police evacuated the area. First reports suggested that the bomb went off in a black taxi after the driver had been forced to drive to Whitehall. The taxi was later reported to be burning fiercely.

Figure 2 Article from The Telegraph, 31 October 1992. @The Telegraph plc, London, 1992.

12

Page 3: Natural language processing with LOLITA

Template: Incident Incident: The bomb explosion outside the Cabinet Office and

outside 10 Downing Street in a black taxi Outside the Cabinet Office and outside 10 Downing Street in the black taxi that a driver drove to Whitehall

Where:

When: When

Responsible: Target: Damage:

Source: Source Date: Certainty:

last night (30 October) a forceful person forced a driver to drive to Whitehall a black taxi that fiercely burned

Cabinet Office Human: Nobody Thing: the black taxi that a driver drove to Whitehall Telegraph 31 October 1993 facts

Figure 3 Example of a template produced from the article in Figure 2.

Cabinet Office and outside 10 Downing Street in a black taxi’.

The LOLITA content scanner has a num- ber of limitations, some of which are dic- tated by the limitation of the LOLITA core; others are limitations specific to the content- scanning task. These limitations are con- cerned with type of input that will be accepted; the degree to which the user can specify what information they require the scanner to extract from its input text; and also the way in which the scanner will present the results of its scanning. Since LOLITA’s contents scanner relies on the core of the system to process incoming text, the content scanner will deal properly only with infor- mation that is acceptable to the core. This currently rules out text containing nonliteral meaning such as metaphor and humour.

On the output side, the form of the tem- plates and the type of information that will be placed in them is currently fixed so that they cannot be changed by users of the system. However, it is relatively straight- forward for the implementors of the system to set up alternative templates and selection criteria.

To produce general-purpose content scan- ning applications it will be necessary to extend the current system to allow users to specify templates and selection criteria themselves. This could initially be done through a conventional user interface which will allow the user to select and combine appropriate options; however, a more satis- factory approach would be to make use of the natural language facilities to allow the users to enter a dialogue in which they describe their criteria to the system, and the system asks for clarification and examples.

/ T / ‘\

MORPHOLOGICAL _ GRAMMATICAL ( ANALYSIS ANALYSIS

L

GENERATION

Figure 4 Block diagram of LOLITA’s components.

Depending on the intended area of the application, it may be preferable to develop content scanners that produce their output in the form of natural language summaries, or alternatively allow the user to ask questions about the information scanned by the system using an interface similar to the query system.

The LOLITA core The LOLITA core is based on the idea that to process natural language, we need to convert the word-by-word representation of text into a representation that is more suit- able for computer analysis and processing. This representation is known as a semantic network, since it aims to reflect the meaning of text rather than its grammatical or morphological structure. The main tasks of the LOLITA core are therefore to convert

incoming text into its semantic network rep- resentation (analysis), to allow operations on the semantic net (inferences) and to con- struct text from pieces of semantic net (generation). The main components of the LOLITA system are shown in Figure 4.

The semantic network consists of a large number (in excess of 30,000) nodes, each representing an entity or event. Each node is connected to related nodes by links. For example the central node in Figure 5 is the semantic net representation for ‘Roberto owns a motorbike’. It is an event node which is connected to the event’s action (own), subject (Roberto), and object (a motorbike). The advantage of this represen- tation is that information about the event can be found simply by following the appropriate links.

13

Page 4: Natural language processing with LOLITA

n OWll

Figure 5 Semantic network representation of ‘Roberto owns a motorbike’.

The analysis section of the LOLITA core is implemented as a number of interlinked phases. Morphological analysis extracts the root of a word and features such as its tense (for example ‘ran’ has root ‘run’ and the feature ‘past tense’); grammatical analysis searches for the structure of a sentence, indicating which phrases are contained within which; semantic analysis converts the structured representation produced by the parser into a semantic net represen- tation, according to the definitions of the words involved. Finally, the pragmatic analysis uses LOLITA’s knowledge about the world to reject interpretations that do not fit with this knowledge.

The inference component of the core pro- vides a facility to search the semantic net- work for events of a specified form. For example, in the query system the question ‘do I own a motorbike?’ generates a particu- lar semantic network structure that is then handed to the inference component. This searches the semantic network, discovering a corresponding structure. The structure dis- covered is handed back to the query system which then uses the generator to produce the appropriate response. The main infer- ence technique used in the LOLITA system is multiple inheritance which allows for specialization and generalization, although most other main types of normal inference methods (for example, implication) are also carried out. More recently, plausible reason- ing inferences have been implemented. These allow the system to draw conclusions that are merely plausible rather than certain. For example, a theory for handling anal- ogies has already been implemented [2]. To prevent disastrous errors from occurring, information which is derived from plausible inferences is marked with a low certainty, and such inferences can be disallowed if undesirable in a particular application.

The generator’s task is to render a portion of the semantic net in natural language. It also attempts to do this within the style constraints set by the system (at the mo- ment, the style constraints are limited to rhythm, sentence length and colourful/plain language). It also attempts to ensure that suf- ficient information is generated for the read- ing unambiguously to determine the objects and events it is talking about.

Limitations Although the LOLITA system demonstrates the potential for building applications using natural language technology, it has a num- ber of limitations which would have to be

overcome when producing products from the system. These limitations require sub- stantial investment to overcome, since they generally require a large amount of develop- ment (rather than research) work.

The LOLITA system is not robust. The system has been very successful in parsing complex real-life text examples but can sometimes fail on a relatively simple construct just because the con- struct has not been tried before. These problems can usually be solved quickly by adding a few simple rules or heuris- tics but it means the system may fail unexpectedly. It is hard to know the coverage of the system or the likelihood of these problems as the system has not been tested on any large corpora. This scale of evaluation has not yet been carried out, because there are serious dif- ficulties involved in making sure that such evaluations are not misleading [3]. The system currently does not run fast enough to process large volumes of written information. Some important inference techniques such as temporal inference are not yet covered. The semantic network, generally, is not heavily populated with information for natural language understanding. If, how- ever, the system is used extensively in one domain (for example, financial articles) then this area of the network quickly becomes more populated. This lack of uniformity of population density causes other problems with transferring between different domains. When the system is used in a new domain, the network will be densely populated in the area of the previous domain, information which will now not be required. A more hierarchical arrange- ment of semantic network may be required which will allow portions of domain specific network to be ‘slotted’ into the general semantic network with- out disturbing it.

Other applications of NLP Although the LOLITA system provides indication of possible applications for NLP technology, there are many other possible applications for the technology. Some of them could be based on the current proto- type, but others still need to be facilitated by further research.

One of the most obvious and potentially lucrative applications is that of machine translation [4]. Commercial systems already exist which provide translation facilities; however, these are generally based on a surface translation of morphology and grammar. They do not take into account the meaning of the text being translated, and their output must be post-edited to ensure a reasonable translation. Nonetheless, these systems are important because they deal with the more tedious aspects of translation and reduce the amount of human effort required. The LOLITA core provides the

possibility of translation which preserves meaning, the general approach being to analyse the source language to produce the semantic net representation and then to generate the target language from the semantic net representation. To do this, the LOLITA core needs to be extended to pro- vide analysis and generation for languages other than English. Some analysis of Italian has been implemented and thus the system has a limited ability to translate from Italian to English. This approach to translation is good for applications in which the meaning of the text is the main consideration and the output can be generated in a fixed style (for example translation of business letters or technical documentation), but will not deal with situations in which the text must be translated retaining the style of the original. Currently the LOLITA generation of English can be produced with a variety of styles, but there is no analysis of style avail- able. Finally, an even more advanced appli- cation of machine translation is to automate the simultaneous translation of speech, thus enabling real-time conversations between speakers of different languages. Such appli- cations are not possible on the basis of current research results, and are likely to be at least a decade away.

Another application area is that of word- processing related tools, such as spelling and grammar checkers. Currently such tools are based on superficial analyses that take no account of meaning. As a result a spelling checker may fail to detect errors in which one word has been incorrectly spelt as another. Other tools might allow for the automated manipulation of text: for example, to change the tense or person. Such tools could be based on facilities already provided by the LOLITA core; however, it must be recognized that, although better results can be expected than any achievable through a superficial analy- sis, they will have a higher computational cost.

Other NLP systems Although we have concentrated on a particu- lar natural language system, there are a number of other systems in existence. These may be classified depending on whether they are actual products or prototypes like the LOLITA system, and also according to their generality.

Specialist systems are systems dedicated to a particular task or a particular form of language. An example is the Topic content scanning system, which is a product pro- duced by Verity, and used by the American Department of Defense. Other examples are METAL, a translation system produced and sold by Siemens Nixdorf, SYSTRAN, a system for translating between Russian and English; andMETE0, a system used by the Canadian government for translating weather reports.

Generic systems are built on the concept of central NLP tasks such as the natural language/meaning conversion discussed

14

Page 5: Natural language processing with LOLITA

above. The assumption behind such systems is that the difficult parts of NLP can be pro- vided by a generic system which can then be specialized (with some additional effort) to produce particular applications. The LOLITA system falls into this category, since it pro- vides a set of core natural language facilities which can then be put together in a way that is suitable for particular applications. ‘IWO other generic systems are the CLARE sys- tem [5] which is being built at Cambridge University and the Cyc system [6] being built by Microcomputer Corporation (MCC) in America. None of these systems has resulted in products as yet.

Finally, a natural language system that can be used for a wide range of NLP tasks without further customization would be a general purpose system. However, no such system exists at present.

Conclusions NLP techniques provide the potential for a new revolution in information tech- nology. They will allow computers to process the vast wealth of written infor- mation currently available, and change the way in which we communicate with computers. The LOLITA project has already produced a large, working system which effectively parses, analyses and generates natural language in an inter- active environment. These capabilities have been used as the basis for some prototype applications, which demonstrate possible uses for natural language tech- nologies. The potential benefit of these applications is enormous, however sub- stantial investment is needed to move from prototype systems such as LOLITA to com- mercial products.

PI

PI

[31

141

PI

WI

Butler, C.S. Computers and Written Texts. Basil Blackwell, 1992. Long, D. and Garigliano, R. A Formal Model for Reasoning by Analogy. Ellis Horwood, 1993. Galliers, J.R. and Spark Jones, K. Evaluating Natural Language Processing Systems. Computer Laboratory, University of Cambridge. Technical Report 291. March 1993. Newton J. (ed.). Current research in machine translation. Computers in Translation, A PracticalAppraisal. Routledge, 1992. Carter, D. and Alshawi, H. Corpus process- ing with a preferential rule-based system. Summary of a presentation to the workshop in integrating speech and natural language. University College Dublin, 1992. Lenat, D.B., Guha, R.V., Pittman, K., Pratt, D. and Shepherd, M. Cyc: toward programs with common sense. Communications of the ACM 33(8), 1990.

15