COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

35
REPORT RESUMES ED 012 923 AL 000 639 COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS. BY- LEHMANN, W.F. TEXAS UNIV., AUSTIN, LINGUISTICS RES. CTR. REPORT NUMBER LRC.--65-41A--1 PUB DATE JAN 65 EDRS PRICE MF..-.1-0.25 HC-S1.48 37F. DESCRIPTORS- tCOMFUTA:IONAL LINGUISTICS, LANGUAGE, LINGUISTIC THEORY, *MACHINE TRANSLATION, MATHEMATICAL LINGUISTICS, LINGUISTIC PATTERNS, STRUCTURAL ANALYSIS, 'DATA PROCESSING, COMPUTERS, CLASSIFICATION, BASED ON A LECTURE GIVEN AT THE UNIV. OF TEXAS SCIENCE CONFERENCE, NOV. 2D, 1964, THIS PAPER PRESENTS IN RELATIVELY NON-TECHNICAL TERMINOLOGY A DESCRIPTION OF THE "STRUCTURAL" APPROACH TO THE STUDY OF LANGUAGE WHICH UNDERLIES THE WORK OF THE LINGUISTICS RESEARCH CENTER. THIS APPROACH ANALYZES LANGUAGE IN SUCH A WAY THAT IT CAN BE MANIPULATED WITH A COMPUTER. STRESSING THE NECESSITY FOR A MORE COMPLETE UNDERSTANDING OF LANGUAGE AS THE BASIS FOR MACHINE TRANSLATION AND COMPUTATIONAL LINGUISTICS, THE AUTHOR DEALS WITH (1) THE FORMAL STRUCTURE OF LANGUAGE, (2) SIMULATION, (3) LANGUAGE DATA PROCESSING, (4) AUTOMATIC CLASSIFICATION, (5) ANALYSIS OF MEANING, AND (6) ACCOMPLISHMENTS IN THE FIELD OF LINGUISTIC RESEARCH. INCLUDED ARE REPRODUCTIONS OF THE ANALYSIS OF A SENTENCE WITH A PARSING DIAGRAM, AND A CHART OF THE LINGUISTICS RESEARCH SYSTEM. (AM)

Transcript of COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

Page 1: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

REPORT RESUMESED 012 923 AL 000 639

COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.BY- LEHMANN, W.F.TEXAS UNIV., AUSTIN, LINGUISTICS RES. CTR.REPORT NUMBER LRC.--65-41A--1 PUB DATE JAN 65

EDRS PRICE MF..-.1-0.25 HC-S1.48 37F.

DESCRIPTORS- tCOMFUTA:IONAL LINGUISTICS, LANGUAGE, LINGUISTICTHEORY, *MACHINE TRANSLATION, MATHEMATICAL LINGUISTICS,LINGUISTIC PATTERNS, STRUCTURAL ANALYSIS, 'DATA PROCESSING,COMPUTERS, CLASSIFICATION,

BASED ON A LECTURE GIVEN AT THE UNIV. OF TEXAS SCIENCECONFERENCE, NOV. 2D, 1964, THIS PAPER PRESENTS IN RELATIVELYNON-TECHNICAL TERMINOLOGY A DESCRIPTION OF THE "STRUCTURAL"APPROACH TO THE STUDY OF LANGUAGE WHICH UNDERLIES THE WORK OF

THE LINGUISTICS RESEARCH CENTER. THIS APPROACH ANALYZESLANGUAGE IN SUCH A WAY THAT IT CAN BE MANIPULATED WITH ACOMPUTER. STRESSING THE NECESSITY FOR A MORE COMPLETEUNDERSTANDING OF LANGUAGE AS THE BASIS FOR MACHINETRANSLATION AND COMPUTATIONAL LINGUISTICS, THE AUTHOR DEALSWITH (1) THE FORMAL STRUCTURE OF LANGUAGE, (2) SIMULATION,

(3) LANGUAGE DATA PROCESSING, (4) AUTOMATIC CLASSIFICATION,(5) ANALYSIS OF MEANING, AND (6) ACCOMPLISHMENTS IN THE FIELDOF LINGUISTIC RESEARCH. INCLUDED ARE REPRODUCTIONS OF THEANALYSIS OF A SENTENCE WITH A PARSING DIAGRAM, AND A CHART OFTHE LINGUISTICS RESEARCH SYSTEM. (AM)

Page 2: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

i

1

In (4) 00 639

THIS IS A WORKING PAPER 0 IT MAY BE EXPANDED. MODIFIED

OR WITHDRAWN AT ANY TIME 0 THE VIEWS. CONCLUSIONS,

AND RECOMMENDATIONS EXPRESSED HEREIN DO NOT

NECESSARILY REFLECT THE OFFICIAL VIEWS OF THE SPONSOR 0

LINGUISTICS RESEARCH CENTER

THE UNIVERSITY OF TEX AS

BOX 7247 UNIVERSITY STATION AUSTIN 12, TEXAS

Page 3: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

U.S. DEPARTMENT OF HEALTH, EDUCATION & WELFARE

OFFICE OF EDUCATION

THIS DOCUMENT HAS BEEN REPRODUCED EXACTLY AS RECEIVED FROM THE

PERSON OR ORGANIZATION ORIGINATING IT. POINTS OF VIEW OR OPINIONS

STATED DO NOT NECESSARILY REPRESENT OFFICIAL OFFICE Of EDUCATION

POSITION OR POLICY.

COMPUTATIONAL LINGUISTICS:

PROCEDURES AND PROBLEMS

"PERMISSION TO REPRODUCE THIS

MUM MATERIAL HAS BEEN GRANTED

By Cc'. p oc......x..71,-,-..--

TO ERIC AND ORGANIZATIONS OPERATiiig

UNDER AGREEMENTS WITH THE U.S. OFFICE OF

EDUCATION. FURTHER REPRODUCTION OUTSIDE

THE ERIC SYSTEM REQUIRES PERMISSION OF

THE mum OWNER."

W. P0 Lehmann

prepared for

National Science Foundation

Grant NSF GN-308

LINGUISTICS RESEARCH'CENTER

The University of Texas

Box 72479 University Station

Austin, Texas 78712

LRC 65 WA-1 January 1965

Page 4: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

CONTENTS

Abstract 0 00 00000.0.0. 00 0 111

Foreword . 000110000000000001 Introduction .

4 1-1

2 Formal Structure 0 .00.00t I I 2-1

3 Simulation 0000000000000 3-1

4 Language Data Processing . 00000 4-1

5 Automatic Classification 000000 5-1

6 Analysis of Meaning . . 000000111 6-1

7 Accomplishments 0 0000000000 7-1

Appendix

i

Page 5: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

ABSTRACT

The necessity for a more complete understand-

ing of language as the basis for machine translation

and computational linguistics is stressed. Other

benefits which will result from this longterm re-

search -- including information retrieval and auto-

matic classification -- are also mentioned.

iii

Page 6: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

FOREWORD

This paper is based on a lecture given at

The University of Texas Science Conference, November 20,

1964. The conference provided a means for scientists

in The University of Texas faculties to get acquainted

with one another and to listen to brief expositions

of research in progress on the campus. Part of the

research mentioned here was performed under grant

NSF GN-708.

v

Page 7: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

1 INTRODUCTION

This paper deals with a relatively new approach

to the study of language, one underlying the work of

the Linguistics Research Center. This view regards

language in such a way that it can be manipulated

with a computer. Yet the view cannot be related to

technological developments, for it preceded the computer.

The resultant approach to language has often been

called structural linguistics.

Language may be studied from many points of

view. One may wish to acquire a graceful mastery of

one or more languages, either for writing special

kinds of texts called poems, or simply to impress as

well as inform an audience. One may wish to learn

about the history of specific languages, how English

is related to Hindi, Greek, Armenian or Irish. The

most prominent interest in language in Western culture

arose from a desire to understand venerated texts,

primarily texts in Hebrew, Greek, Latin, the Bible

and the classics. The understanding of these texts

led to the development of special techniques and

attitudes about language. For us, oddly enough in

contrast with the Greeks and Romans, the written

language has seemed more fundamental than the spoken,

and we have spent more time learning to read than to

speak languages, whether French, German, Russian, or

the cited classical languages. Further, since in our

day written materials are broken up into units called

words, these seem to us the fundamental entities of

language.

1-1

Page 8: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

Moreover, since these languages, especially

Latin seem somehow to be model languages, we have

sought a mastery of current languages, including our

own, through descriptions--grammars--which are modeled

on the grammar of Latin. To master a language,

including English, our grammaridns, teachers and students

note its resemblances to Latin and also the differences

from it. This procedure may be compared with that of a

geographer who adopts one location, for example, New

York City, as the ideal and describes all other locations

by their resemblances to it. From such a geographer

we would not get a map of London, Paris or Moscow, but

rather various maps of New York modified in accordance

with deviations from New York in these cities.

It is not my aim to present a critique of

any view of language, or of our methods of teaching

languagesp or even of any type of research on language.

But since we have all studied languages in accordance

with the Latiri-based approach, we regard any language

in accordance with the views given us in our schools.

These views must therefore be specified if we are to

understand one another° I might also mention that the

first attempts at computer processing of language

failed because the scholars concerned viewed the es-

sential problem as the manipulation of words.

Besides discussing a somewhat different

approach to language I will touch on the linguistic

investigations it has prompted and is continuing to

require° I will also deal briefly with the require-

ments this approach is making on computer programming

and may make on computer technology.

1-2

Page 9: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

2 FORMAL STRUCTURE

Possibly the most important feature of struc-

tural linguistics is the understanding that language

has a formal structure, composed of various sub-struc-

tures. In these structures the function or value of

any entity is determined largely by its relationship

with other entities. Entities then are not defined

by their relationship to the outside world; a noun

for example is not defined as the name of a person,

place or thing. Viewing language in this way seems

to a linguist somewhat similar to presenting mathe-

matics through concrete objects, never to add 2 and 2,

but always two apples to two apples, and so on. There

is little doubt that our understanding of numbers,

our progress in mathematics, would have been hampered

if we had dealt with them only in connection with the

outside world rather than as abstract signs. Lin-

guists hold that such a view of language has impeded

our understanding of it.

Some decades ago a few scholars began to

examine language as a system of signs whose function

was specified by their interrelationships. This

approach to language--this theory, if you wish- -

would define a noun in any given language by its

relationship to other entities in that language. A

noun in English, for example, might be defined as an

entity with certain relationships to inflectional

elements, to a Z-like entity in the plural: arm :

arms, or in the possessive: man : man's. Other

languages might not have nominal inflection and ac-

cordingly would 'not, have a class of nouns. In Japa-

nese, for example, Only verbs are inflected; we cannot

2-1

Page 10: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

then speak of a class of inflected nouns. Another

basis of definition of entities might be by their

relationship to independent entities, for example,

articles: the; a, or to verbs, By such a definition

man is a noun because it can follow the; went on the

other hand is not. Using such a procedure, we can

identify nouns in Japanese. Further, we can also

identify larger acceptable entities; such as sentences,

by their entities and the interrelationships of these;

men talk, for e::ample9 is an acceptable English sen-

tence, but not men happilys or even men ham:.

When this approach to language was pursued, the

work of linguistics came to be looked on as the deter-

mination of the entities of any language and their

interrelationships.

Two requirements are necessary before one

can deal usefully with language in this way. We must

first determine whether the materials are genuine,

whether for example an English speaker permits the

sequence: men talk. Next, we ascertain whether the

entities in this sequence have a characteristic meaning.

In such determination we elicit comparable sequences,

e.g. men walk; then talk, and so on. With such con-

trasting sequences we would satisfy ourselves that m

is a characteristic markers for it is the only entity

distinguishing men talk from then talk, or distin-

guishing man talks (a statement an anthropologist

might make) from Ann talks (a statement one might make

about a very young lady), Similarly, t is a charac-

teristic sound markers distinguishing ten from men,

talk from walk, and so on.

2-2

Page 11: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

In addition to determining the characteristic

entities of sound in a language, we may also determine

larger entities, for example, talk as opposed to walk.

These differ from entities like the first consonants

of men, ten and then in that they have established

relationships to certain concepts. Briefly, we say

they have meaning. The first consonants of men, ten

and then do not. They serve to distinguish meanings,

but we cannot associate with them any given concepts,

such as 'animateness', 'number' or 'temporality'.

Rigorous techniques for determining entities

of both kinds have been developed.

When such entities are specified in a given

language, linguists set out to determine their role- -

one might say, their properties. In English there

are about forty entities like m and t. These may be

regarded as signs, comparable to other kinds of signs

man uses, e.g. 3 4. Just as a mathematician might

investigate relationships between such units in a

given number system, setting up various classes, e.g.

primes, so a linguist might investigate the role of

such entities in a given language. He might determine

what relationships t has with regard to the other

entities of sound. In English, for example, t may

precede e if n follows, but not e alone; there is no

English sequence te. Nor are there sequences like

tne, etn, and so on. Other such problems will occur

to any linguist, mathematician, or to anyone who en-

joys manipulating signs. Yet few such problems have

been investigated, even in a widely used language like

English, to say nothing of 5,000 other languages. We

have not had the personal nor the physical resources.

2-3

Page 12: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

A similar range of problems might be cited

in the investigation of entities like walk talk take

brake and so one We might find sequences in which man:

men precede any of these, e.g. the man walked, the man

braked around curves, etc. But we do not find sequences

like: walkman, talkman, takeman paralleling brakeman.

A complete description of any language would specify

which of such sequences occur.

At this stage of his investigations a linguist

does not deal with meaning. He has determined that

brake differs from take, that it has a meaning; but in

examining possible sequences like brakeman he deals

only with its properties of occurrence, Nonetheless

this second type of investigation, noting the inter-

relationships between entities like takes brakep

man, provides even more problems than does the

first.

Still other entities must be identified in

language and investigated similarly. But the two types

of entities I have selected may exemplify the approach

of structural linguistics.

Those structural linguists who concern them-

selves exclusively with the study of sets of linguistic

entities and their interrelationships are sometimes

called mathematical or computational linguists. Other

linguists may deal with other language problems--the

pronunciation of talk, walk in various areas, the

stylistic differences between talk and speak; and

so on. But a computational linguist limits his concern

to sets of entities and their interrelationships.

2-4

Page 13: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

If his approach to language is valid, in

using language men acquire a number of entities and

learn how to manipulate them in relation to one

another. Further, if a machine could be devised

which would store the number of entities stored by

man, with rules specifying their relationships to

other entities, the machine might simulate manes mas-

tery of language.

2-5

Page 14: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

3 SIMULATION

As is well-known, about twenty years ago a

machine was developed which seemed to have the essen-

tial capabilities, the computer. Possibly the manip-

ulation of language would never have engaged the at-

tention of computer specialists if the problem of

rapid intercommunication had not become so prominent.

To be sure computation centers might have found it

amusing in time to have a few language games available

for visitors, when they became bored with tic-tac-toe,

checkers, chess or go. But since the scientists, who

were nursing along the infant computer and contemplating

uses for it when it matured, had just been involved in

international struggles which pointed up the importance

of reading the scientific publications of the other

side, they suggested that the computer might solve the

problem of intercommunication. The computer therefore

was looked on as the machine to take over the unin-

spiring activity of translation; supporting agencies

provided time on computers and a small amount of money

to research workers, whose goal was to be machine trans

lation. This seemingly overriding goal was the prime

activity for which language specialists might use

computers. To the outside world todarstill9 linguists

doing research with computers are working on machine

translation.

With a million words a day of important

materials awaiting translation from Russian to English,

let alone materials of secondary importance or materials

in Chinese, Japanese, German, French and so on, machine

translation would be a fine accomplishment. But the

3-1

Page 15: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

problem requires a bit of preliminary work. We may view

the essential requirement one of synthesizing sentences.

This activity may be compared to synthesizing proteint

molecules--though nothing like the expenditure of time

and money has been applied to linguistic investigation

as to that of chemistry.

One of the first problems we may note is that

language is not a simple linear structure; rather, it

consists of numerous structures. One is made up of

entities like t m and so on, which might be compared

with atoms; this structure contains relatively few

entities, but their rules of interrelationship are com-

plex. A second structure is made up of entities like

then men walk, which might be compared with radicals;

this structure contains a great number of entities,

possibly with somewhat less complex rules of inter-

relationships. From these the smallest free form of

language is constructed, the sentence. In making

sentences, in using language man has somehow learned to

master both of these structures. More, he has learned

the relationships of the entities in the second struc-

ture to a totally different structure, that of concepts.

Since computer manipulation of language is a type of

simulation, before we can use computers effectively for

managing language, we must understand how these various

structures relate to one another, how language functions.

3-2

Page 16: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

4 LANGUAGE DATA PROCESSING

Of the various problems, some are straight-

forwa?d, for example, the amassing of entities and their

rules, We spend the first ten years of our lives ac-

quiring control over one language, continue to add to

our stock of entities, and rarely achieve mastery over

a secotd language. To give the computer similar

opportunities we must have large-scale programs, by

means of which we can store genuine materials and

materials with a characteristic meaning. A great deal

of effort has been expended by members of the Linguistics

Research Center over the past five years to develop the

system of programs which handle the data of language and

their interrelationships.

Man has taken care of this problem very clev-

erly. He reduces language to a set of entities of

sound, about for-ty in a language, and accordingly has

relatively few building blocks to control. Unfortu-

nately no machine has been devised to match manes dis-

criminatory powers in managing the entities of speech.

Accordingly at present, machine manipulation of language

must be based on the second level with its tremendous

number of entities.

Since our work in the Center is still experi-

mental, it is difficult to forecast how many such

entities must be stored in a computer. Some estimates

put the number of chemical terms in German at two

million; the rules for relating them to other entities

of the language will obviously be fewer.

In the relatively small computers available today

the rules indicating interrelationships and vocabulary

4'l

Page 17: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

items of a language must be stored on magnetic tape.

The programming system developed at the Linguistics

Research Center has been successful in analyzing

materials of limited vocabulary and syntactic complex-

ity, consisting of about 50,000 rules and items in

each language.

Until one has dealt with the highly rigorous

computer it is almost impossible to visualize the prob-

lems involved in a thorough analysis of language. A

simple example from a physics textbook may illustrate

some of them:

Loudness is the property of sound de°

termined by the effect of the power

of the sound waves on our ears.

Let us suppose that we write a computer routine--a

syntactic rule--relating of to a following noun, as in

of sound; the rule will not then handle the sequence of

the power, for here of is followed by the definite

article. If we modify our rule accordingly, we still

have not handled the use of of in of the sound waves,

for here the article is followed by a noun used as

adjective. In putting a sentence like this into a

computer, we must therefore provide for sequences

of of and a variety of entities. Obviously our rules

cannot be simple, though our example may have been.

Another entity of the sentence9 on may

illustrate a further type of problem° If we relate

on to the surrounding entities, we arrive at the

possible sequence: the sound waves on our ears. ThisIIMI1114110111/IJWOMMOVIIMINI

sequence might compare to that of the flat waves on the

4-2

Page 18: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

flag-pole or the policeman waves on the traffic. But

such relationships which would appear identical to a

computer trouble us; the sentence from our physics

textbook seems absurd to us if on is related to waves

in either of these ways. We have learned that on is

related to effect and that the meaningful seqtence is

effect on our ears. We scarcely need to discuss the in-

adequate translations that would be produced if such

sentences were put word-for-word into German, Russian

or other languages.

Since an English speaker understands his

language in this way, a computer must also be prepared

to manipulate it accordingly. To arrange such manip-

ulations we must describe English far more precisely

than has ever been done before. The required detail of

description has never been provided before because

native speakers master such sequences, and we are

charitable to foreigners who learn inadequate English

from our inadequate grammars. But if a computer makes

any requirement, it is for precision. A computer would

not be happy with our simple sentence until it knows

what to do with every entity, including on. Conse-

quently a linguist has to determine the role of an

entity first of all, then describe it. Since even the

large dictionaries which have been produced for Eng-

lish, German and the other widely studied languages

have not described these languages adequately, lin-

guists in the Linguistics Research Center are row at

work producing such descriptions--writing rules for

English, German, Russian and other languages. Figures

1 - 5 illustrate the procedures involved in making

4c3

Page 19: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

a syntactic analysis of an English sentence, in

accordance with a grammar written by Dr. Wayne Tosh

of our Center.

The resultant rules are many and intricate.

When produced, they must be handled by the computer,

but kept independent of it through use of generalized

computer programming. If, for example, a specialized

program were written for handling combinations of

prepositions plus nouns or prepositions plus articles

plus nouns, it would have to be revised to handle

sequences of prepositions plus articles plus noun-

adjectives plus nouns, as in of the sound waves. The

Linguistic Research System, produced under the direction

of Eugene Pendergraft, was devised to meet this require-

ment of generalization. With this system linguistic

rules, independent of specialized computer programs,

may be produced to handle phrases of various length- -

preposition plus noun as in of sound, preposition plus

article plus noun, as in of the pover, and longer phrases

like of the sound waves. Other instructions alert the

computer to watch out for prepositions like on after

a noun like effect. A chart of the system indicates

the demands placed on computer manipulatioL of language

and also one of the results of five years of work,

supported by the US Army Electronics Laboratories and

by the National Science Foundation.

One of our practical problems is to achieve

an understanding by outsiders of the use of computers

in processing linguistic material. Most of us had

our notions about scientific procedures determined by

elementary science class, in which we probably used a

Bunsen burner. This early activity seems to leave the

4-4

Page 20: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

indelible impression that scientific equipment, for

example a_ ccputer, is like a Bunsen burner. There is

little variety of use for a Bunsen burner--it merely

heats things. The heat isn't different if one lights

it with a flint, or a match--if one strikes the match

on a piece of sandpaper or one's thumbnail. By anal-

ogy it is assumed that the machine is the essential

part of computation; after one switches on the power,

a computer can cook your data as well as mine. Yet

in language, as in the social sciences, the important

part of computation is the program. The importance

of how one utilizes a machine rather than the makeup

of the machine may be one of the essential differences

between work in the social sciences and that in the

natural sciences. Possibly software and hardware

sciences would be more appropriate names.

Page 21: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

5 AUTOMATIC CLASSIFICATION

Yet even a system with programs of the com-

plexity of those illustrated is inadequate for handling

language. In a sentence of fewer than twenty elements,

for example, there are more than a million possibili-

ties of analysis. But this figures large though it is,

fails to take into account an analysis for meanings for

determining among other things that in our sentence

sound is similar in meaning to noise rather than to

healthys valid, as in a sound mind or a sound theorz.

When we handle the multitude of entities necessary in

analysis of meaning we will deal with many more

possibilities of interpretation than are found for on.

In managing these, our present computers would be choked.

Even the larger computers now becoming available would

deal with the quantities of data slowly. Adequate

speed seems possible only by refinements of computer

theory and in improved techniques of classification.

A few years ago R. M. Needham of Cambridge

University, pointed the way to such classification

with his clumping theory. His procedures are being

expanded for application to larger sets of data by

A. G. and N. Dale of our Center. Details are pro-

vided in the paper, A Pro ummiauuLELsaLllt2:maticcialtsish.122115atip2LILLinguisticand Information Retrieval Research LRC 64 WTM-40

written by A. G. and N. Dale and E. D. Pendergraft.

With other papers, this is available from the Center.

Even the procedures described in this paper require

a great deal of computer time for handling a relatively

small number of entities. Further research is being

pursued to improve and speed up the procedure. I have

5-1

Page 22: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

time merely to mention such research; but would also

like to point out that it was not even envisaged be-

fore language analysis with computers was undertaken.

The amount of linguistic data which must be manipulated,

as well as its complexity; has pointed up the need for

research in fields of applied logic or mathematics that

would not have been related to language investigation

a few years ago. Students in the sciences; for whom

the required language courses may seem to have little

lasting value, might well consider applying themselves

to these problems. Solutions will follow only from a

quantitative approach; generally lacking in previous

students of language.

5-2

Page 23: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

6 ANALYSIS OF MEANING

But though we face numerous problems in the

development of computer systems and in the theoretical

work which must be carried out before systems and com-

puters can manage efficiently the huge and complex

amounts of data, our largest problems remain in the

understanding of language° Chief among these is the

treatment of meaning° In dealing with meaning we are

probably a bit farther along than Plato, though not

much. Our dictionaries largely sidestep the pro-

blem; they set out to provide synonyms, whether mono-

lingually or bilingually° Since they are fairly effec-

tive tools, we can handle translation of a sort without

understanding meaning° But for competent translation,

for automating indexing and abstracting, for problems

in artificial intelligence, we will have to control

meaning as we now do syntactic relationships°

Our theoretical approach is clear° We assume

that language is structured at the level of meaning

similarly to its structure at the levels of sound

and syntax. Again, we do not relate entities to the

outside world, but to concepts° Still the problems

of analysis are staggering° The sheer magnitude of

the data--all human knowledge--is troublesome enough.

But how to classify it? By specialties as we do in real

life? Should one computer handle nuclear physics,

another the physics of light, another molecular biology,

and so on? (If we did, we would not welcome a physicist

who also concerns himself with biology)0 But if we

divide the universe of concepts in this way, what type

of hierarchical arrangement should we use? If9 for

6-1

Page 24: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

exampl7a, we define man as 'male human being', should

we distinguish between the concepts 'male° and 'hUman

being' because 'male' is automatically supplied in such

sequences as 'he was a man who 000, the king is a man

who0.0'? It will be difficult to answer such questions

until we carry on a fair bit of investigation° Be-

fore then, it will even be difficult to pose the proper

questions.

6-2

Page 25: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

7 ACCOMPLISHMENTS

It may be disappointing for non-linguists

to hear that linguistic work has scarcely begun, with

or without computers. Be we have some accomplishments.

Some theoretical positions seem supportable. We are

on our way to an extensive and flexible linguistic

research system, and expect to have adequate computers

to make use of it. The traditionally lone linguist

is beginning to work with specialists in related fields.

Even the achievement of analyzing language syntactically

may seem small. But our tools are still inadequate.

Given satisfactory scanning devices and more powerful

computers we will be able to use our system for ana-

lyzing more than a snatch of language. Already straight-

forward linguistic applications may be carried out,

if adequate resources are provided; any book may be auto-

matically indexed, and accordingly among other things

more readily proof-read. Bibliographical and other

data may be managed automatically; in a pilot project,

the Center has listed all Slavic books in the University

Library, so that anyone interested in Tolstoy9 in

Russian novels or the like, may be given an immediate

print-out of the titles. Other such projects need only

financial support for achievement. The chief aim of

the Center, however, is to continue theoretical investi-

gations of language and data processing techniques, and

the preparation of computer programs, so that ultimately

a computer will be able to manipulate language with

somewhat the same proficiency as does man.

7-1

Page 26: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

i THIS IS A SENTENCE ANALYZED BY THE LINGUISTICS

RESEARCH SYSTEM. I

SCIENCE CONFERENCE CORPUS

CORPUS DISPLAY

OICCC001

CIC00002

01000003

01001001

01C01002

C1002001 (SEE ACCOPPANYING CISPLAYS)

01002002

20 NUVEMDER 1964

LNIVERSITY OF TEXAS SCIENCE CONFERENCE

NCVEMBER 20, 1964

INPUT SBNTENCE

TO

ANALYSIS PROGRAM

Page 27: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

I

SCIENCE CONFERENCE

CORPUS DISPLAY

PAGE 1

01

001

OCO

CORPUS

TIS IS

10* 4

A *

20* *

SENTENCE *ANALYZED

30* *

64-Y

20

THE

NOVEMBER 1964

4 0*

50*

**

LING*UISTICS RE*SEARCH

60* *

SYS*TEM

1

70*

801-

90*

1004-

4*

4:

1-

**

41.

2.

MATRIX IDENTIFYING

CHARACTER POSITIONS

IN INPUT SENTENCE

Sentence begins

in col. 2 and

ends in col.

64.

Page 28: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

SCIENCE CONFERENCE GRAMMAR

SYNTACTIC DATA----STRATUM 2--FORM

SORT

NOTES

FORM

CESIGNATUM

0 0

2C

67

1P 1.000000

00 0

2C

72

1

P 1.000000

10 0

2C

75

P 1.000000

0D 0

2C

77

P 1.000000

00 0

2C

79

P 1.000000

0D 0

2C

82

P 1.000000

0D0

2C

84

P 1.000000

0D 0

2C

85

P 1.000000

1

20 NOVEMBER 1964

s

V OIRMNR, * THIS

1S

1

N5A

1* SEN

* TENCE

B

V4C

1

* ANALYZ

1

V PRPSTN

I

* BY

1

DTRMNR

I

* THE

11

V N3A

* LIN

* GUIS

11, TICS

6B

1

N5H

i* RE

* SEARCH

B

N5A

* SYS

* TEM

B

3.

GRAMMAR USED

IN AUTOMATIC ANALYSIS

Symbols in DESIGNATUM column are class

names of construction lying to right of

symbol.

Each entry is separate rule

identified uniquely by number in FORM

column.

Page 29: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

*nun

......

.

SCIENCE CONFERENCE GRAMMAR

20 NOVEMBER 1964

SYNTACTIC DATA----STRATUM 3--FORM SORT

NOTES

FORM

D1

2C 65

P 1.000000

-0

D 2

2C 66

P 1.000000

-0

D I

2C

P 1.000000

-0

D 1

2C

P 1.000000

-1

D 2

2C

P 1.333312

-2

D 1

2C'

P 1.000000

-1

D 2

P 1.000000

-0

D 1

P 1.000000

-0

D 2

P 1.000000

-0

2C

2C

2C

68

69

70

71

73

74

76

DESIGNATUM

1V SNTNC

V CLS

*

S 1

1 1 1 1 1

V CLS

V DTRMNR

V BE

IS

SNGLR

S1

PRSNT

S 2

V BE

* IS

V NP

SNGLR

SNGLR

PRSNT

S 1

V NP

* A

V NMNL

SNGLR

1

A

Si

V NMNL

r V NMNL

V VRBL

AA

-02

/6/S

PHRS

S1

S 2

V NMNL

iV N5A

AS

1

/6/S

V VRBL

V VRBL

V PRPSTN

- D2

-D2AN

PHRS

PHRS

rS

IS 2

V VRBL

V V4C

* ED

- D2AN

S1

V PRPSTNI V PRPSTN

V NP

PHRS

S1

SNGLR

S 2

PAGE 1

Page 30: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

NOTES

FORM

O 2

2C

P 1.000000

-1

D 2

2C

P 1.333312

- 2

O 2

2C

P 1.333312

- 2

D 1

2C

P 1.000000

- 1

78 80

81 83

0 SIGNATUM

1V NP

V DTRMNR

V NMNL

SNGLR

1A

S1

S 2

1V NMNL

V NMNL

V N5A

AA

S 2

S1

1V NMNL

V N3A

V NMNL

Ai

S1

A /6/S

S 2

1V NMNL

iV N5H

AIS

1

/6/S

PAGE 2

Page 31: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

,,.7.

7RK

IIM

Prrq

,'",=

.717

01..7

CCRPUS Cl

FROM

10

10

SAMPLE 001

TO

PRCBABILITY

10

11

*A A

PAGE 2 N

OT

ES

10

56

1.7777C-9

69

70

71

72

73

74

75

76

77

76

79

81

82

83

84

10

63

1.18511-11

69

80

7C

71

72

73

74

75

76

77

78

79

81

82

83

84

85

10

63

1.18511-11

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

11

11

12

14

*SEN

12

19

1.0CC00-1

72

12

19

1.00000-2

71

72

12

20

SENTENCE

12

56

1.77770-8

70

71

72

73

74

75

76

77

78

79

81

62

83

84

12

63

1.18511-10

80

70

71

72

73

74

75

76

77

78

79

81

82

83

84

85

12

63

1.18511-10

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

15

19

*TENCE

20

20

20

20

21

21

itA

21

26

ANALYZ

21

26

1.0

75

21

28

1.0

74

75

21

56

1.33331-4

73

74

75

76

77

78

79

81

82

83

84

21

63

1.77770-7

73

74

75

76

77

78

79

80

81

82

83

84

85

Page 32: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

CORPUS 01

FROM

23

27

27

27

SAMPLE 001

TO

PROBABILITY

23

28 28

29

*A

*ED

ED

ED

PAGE 3 NOTES

29

29

*

29

29

30

31

*BY

30

31

1.0

77

30

32

BY

30

56

1.33331-4

76

77

78

79

81

'

82

83

84

30

63

1.77770-7

76

77

78

79

80

81

82

83

84

85

32

32

4

32

32

33

35

*ME

33

35

1.0

79

33

36

THE

33

56

1.33331-4

78

79

81

82

83

84

33

63

1.77770-7

78

79

80

81

82

83

84

85

36

36

*

36

36

37

39

*LIN

37

47

1.0

82

37

48

LINGUISTICS

37

56

1.33331-3

81

82

83

84

37

63

1.77770-6-

80

81

82

83

84

85

Page 33: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

DT

RM

NR

IS

67

TH

ISIS

CLS

66

BE SN

GLR

PR

SN

T

NP

SN

GLR

NM

NL

A

70

SN

INC

VR

BL

-D2

PH

RS

73

PR

PS

TN

PH

RS

NP

SN

GLR 78

5. P

AR

SIN

G D

IAG

RA

M

Rec

onst

ruct

ed fr

om A

NA

LYS

IS d

ispl

ay to

illus

trat

e an

alys

is p

rovi

ded

by c

ompu

ter.

NM

NL

A

NM

NL

A /o/S

71I

N5S

/712

\S

EN

TE

NC

E

VR

BL

-D2A

N

/4D

TR

MN

RV

4CP

RP

ST

NI

A77\

79

AN

ALY

ZE

DB

YT

HE

LIN

N3A

N5H

N5A

GU

IST

ICS

RE

SE

AR

CH

SY

ST

EM

Page 34: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

LEXICALANALYSISANAL YSIS DISPLAY

--->

INPUTCORPUS

CORPUSREVISION

CORPUSDISPLAY

CORPUSSELECTION

CORPUS MAINTENANCE

LEXICALANALYSIS & CHOICE

1

SYNTACTICANALYSIS & CHOICE

SEMANTICANALYSISANALYSIS & CHOICE

LEXICALANALYSIS DISPLAY

LEXICAL & SYNTACTICANALYSIS DISPLAY

#

LEXICAL, SYNTACTIC& SEMANTIC

ANALYSIS DISPLAY

MONOLINGUAL RECOGNITION

INPUTDISTRI-BUTION

TRANSFER MAINTENANCE

MONOLINGUALINPUT TRANSFERSELECTION

LEXICALANALYSIS

SYNTACTICANALYSIS

SEMANTICANALYSIS

SYNTACTICANALYSIS DISPLAY

SEMANTIC,

ANALYSIS DISPLAY

INTERLINGUAL RECOGNITION

INPUTTRANSFEF

GRAMMAR MAINTENANCE

INPUT GR.:MMARSELECTION

Page 35: COMPUTATIONAL LINGUISTICS-PROCEDURES AND PROBLEMS.

RANSFER MAINTENANCE MONOLINGUALTRANSFER REvISION

THE UNIVERSITY OF TEXAS

LINGUISTICS RESEARCH SYSTEM

SUBSTI-TUTION )

MONOLINGUALTRANSFER DISPLAY

INTERLINGUALTRANSFER REVISION

DISTRI-BUTION

CORPUSDISPLAY

INTERLINGUALTRANSFER DISPLAY

OUTPUTDISTRI-BUTION

OUTPUTCORPUS

MONOLINGUALINPUT TRANSFERSELECTION

INTERLINGUAL MONOLINGUALTRANSFER OUTPUT TRANSFERSELECTION SELECTION

I

AYINTER -LINGUA

LEXICALSYNTHESIS

LEXICALCHOICE & SYNTHESIS

AY

AY

TRANSFER

;RAMMAR MAINTENANCE RULEREVISION

OUTPUTTRANSFER

SEMANTICSYNTHESIS

INTERLINGUAL PRODUCTION

44

SYNTACTICCHOICE G SYNTHESIS

SEMANTICCHOICE t SYNTHESIS

MONOLINGUAL PRODUCTION

OUTPUTGRAMMAR

PROBABILITYREVISION

INPUT GRAMMAR GRAMMAR OUTPUT GRAMMARSELECTION SELECTION

GRAMMARDISPLAY

Circles represent magneticdata tapes. Boxes representprograms; those with heavylines are scheduled forcompletion by the end ofthis year.