Theoretical Constructs and Measurement of Performance and ...

11
Theoretical Constructs and Measurement of Performance and Intelligence in Intelligent Systems LarryH. Reeker National Institute of Standards and Technology Gaithersburg, MD 20899 ([email protected]) Abstract This paper makes a distinction between measurement at surface and deeper levels. At the deep levels, the items measured are theoretical constructs or their attributes in scientific theories. The contention of the paper is that measurement at deeper levels gives predictions of behavior at the surface level of artifacts, rather than just comparison between the performance of artifacts, and that this predictive power is needed to develop artificial intelligence. Many theoretical constructs will overlap those in cognitive science and others will overlap ones used in different areas of computer science. Examples of other "sciences of the artificial" are given, along with several examples of where measurable constructs for intelligent systems are needed and proposals for some constructs. Introduction There are a number of apparent ways and certainly many more not so apparent ways to measure aspects of performance of an intelligent system. There are a variety of things to measure and metrics for doing so being proposed at this workshop, and it is important to discuss them. To develop a measure of machine intelligence that is supposed to correlate with the system's future performance capability on a larger class of tasks considered intelligent would be analogous to human IQ. That would require agreement on one or more definitions of machine intelligence and finding a set of performance tasks that can predict the abilities required by the definition(s), and still might not say much about the nature of machine intelligence or how to improve it. One reason that metrics of performance (and perhaps, of intelligence) are needed is that they directly address the fact that it has been difficult to compare intelligent systems with one another, or to verify claims that are made for their behaviors. Another reason is that having measurements of qualities of any sort of entity provides a concrete, operational way to define the entity, grounding it in more than words n.." __ __ . . ..". ___n_. _ .u._ ___ ... _____. alone. All of these aspects - comparability, verifiability, and operational grounding - were undoubtedly at least part of what Lord Kelvin meant about measurements providing a feeling that one understood a concept in science. (See the preamble to this workshop [Meystel et al 00]: "When you can measure what you are speaking about and express it in numbers, you know something about it.") The measurements that form the primary topic of this paper are of a different type. They are ones that look ahead to the future, when the intelligent systems or artificial intelligence' field is more mature. The notion of mature field is defined here in terms of scientific theories that predict the performance of the systems on the basis of the underlying science. It is suggested that really valuable measurements require reliable predictions of this scientific sort, rather than just ways to compare the technological artifacts based on the science. To do this, it is necessary to develop theories containing measurable theoretical constructs, as will be discussed below. The discussion of metrics for attributes of theoretical constructs herein does not conflict in any way with the idea of overall system measurements, comparisons, or benchmarks, which are useful for the purposes mentioned above. In fact, it is a philosophical problem to decide where theoretical constructs stop and empirical constructs begin. Measurements of artifacts will be referred to as surface measurements, those of a more theoretical nature as deep measurements, terms borrowed from Noam Chomsky's [65] terms for levels of syntactic description. The question of "how deep" can be left open at this time. This paper advocates looking for measurable theoretical constructs at the deeper level that will predict surface behaviors at the level of the system or subsystem, or of an entire artifact. . The latter term will be used herein because the shortened form, "AI" is more common than "IS". / 49

Transcript of Theoretical Constructs and Measurement of Performance and ...

Page 1: Theoretical Constructs and Measurement of Performance and ...

Theoretical Constructs and Measurement of Performance and Intelligence inIntelligent Systems

LarryH. ReekerNational Institute of Standards and Technology

Gaithersburg, MD 20899([email protected])

Abstract

This paper makes a distinction betweenmeasurement at surface and deeper levels. Atthe deep levels, the items measured aretheoretical constructs or their attributes inscientific theories. The contention of the paper isthat measurement at deeper levels givespredictions of behavior at the surface level ofartifacts, rather than just comparison between theperformance of artifacts, and that this predictivepower is needed to develop artificial intelligence.Many theoretical constructs will overlap those incognitive science and others will overlap onesused in different areas of computer science.Examples of other "sciences of the artificial" aregiven, along with several examples of wheremeasurable constructs for intelligent systems areneeded and proposals for some constructs.

Introduction

There are a number of apparent ways andcertainly many more not so apparent ways tomeasure aspects of performance of an intelligentsystem. There are a variety of things to measureand metrics for doing so being proposed at thisworkshop, and it is important to discuss them.To develop a measure of machine intelligencethat is supposed to correlate with the system'sfuture performance capability on a larger class oftasks considered intelligent would be analogousto human IQ. That would require agreement onone or more definitions of machine intelligenceand finding a set of performance tasks that canpredict the abilities required by the definition(s),and still might not say much about the nature ofmachine intelligence or how to improve it.

One reason that metrics of performance(and perhaps, of intelligence) are needed is thatthey directly address the fact that it has beendifficult to compare intelligent systems with oneanother, or to verify claims that are made fortheir behaviors. Another reason is that havingmeasurements of qualities of any sort of entityprovides a concrete, operational way to definethe entity, grounding it in more than words

n.." __ __ . . ..". ___n_. _ .u._ ___ ... _____.

alone. All of these aspects - comparability,verifiability, and operational grounding - wereundoubtedly at least part of what Lord Kelvinmeant about measurements providing a feelingthat one understood a concept in science. (Seethe preamble to this workshop [Meystel et al 00]:"When you can measure what you are speakingabout and express it in numbers, you knowsomething about it.")

The measurements that form the primarytopic of this paper are of a different type. Theyare ones that look ahead to the future, when theintelligent systems or artificial intelligence' fieldis more mature. The notion of mature field isdefined here in terms of scientific theories thatpredict the performance of the systems on thebasis of the underlying science. It is suggestedthat really valuable measurements requirereliable predictions of this scientific sort, ratherthan just ways to compare the technologicalartifacts based on the science. To do this, it isnecessary to develop theories containingmeasurable theoretical constructs, as will bediscussed below.

The discussion of metrics for attributes oftheoretical constructs herein does not conflict inany way with the idea of overall systemmeasurements, comparisons, or benchmarks,which are useful for the purposes mentionedabove. In fact, it is a philosophical problem todecide where theoretical constructs stop andempirical constructs begin. Measurements ofartifacts will be referred to as surfacemeasurements, those of a more theoreticalnature as deep measurements, terms borrowedfrom Noam Chomsky's [65] terms for levels ofsyntactic description. The question of "howdeep" can be left open at this time. This paperadvocates looking for measurable theoreticalconstructs at the deeper level that will predictsurface behaviors at the level of the system orsubsystem, or of an entire artifact.

.The latter term will be used herein because theshortened form, "AI" is more common than "IS".

/

49

Page 2: Theoretical Constructs and Measurement of Performance and ...

The remainder of the paper explains theform that we will expect for AI theories in thefuture if they are to qualify as scientific theoriesand suggests theoretical constructs that may havemeasurable properties. It will discuss existingconstructs that are developing as candidates fordeep metrics and how they may relate to surfacemeasurement. It will compare them to constructsin existing scientific theories at deep and surfacelevels. It will suggest that they will naturallyrelate to constructs from the artificial and naturalsciences, specifically from cognitive science andcomputer science,

Computation Centered and CognitionCentered Approaches to AI

At all levels, from surface to deep, theconstructs to be measured may depend on theapproach taken to AI. There are twodistinguishable approaches that have been takenover the years, which we will call "computationcentered" and "cognition centered"-, Thecomputation centered approach focuses on howcertain tasks can be accomplished by artificialsystems, without any reference to how humansmight do similar tasks. We do not usually thinkof numerical calculation as AI, but if we did, wewould have to think of the way it is done ascomputation centered. There is no particularreason to make it cognition centered.

In the cognition-centered approach to AI,the tradition is to discover human ways of doingcognitive tasks and see how these might be doneby intelligent systems. Sometimes themotivation for this approach has been to try tofind plausible models for human cognitiveprocesses (cognitive simulation), but for AIpurposes, it has often been a matter of usinghuman clues to try to accomplish thecomputation centered approach. Someresearchers feel that developing the artifactsusing cognitive ideas may lead to more robust AIsystems (using "robust" in the sense that thesystem is not narrow or "brittle" in its intelligentcapabilities). But it is a natural way to thinkabout the developing AI capabilities, since notall areas related to intelligent activities have been

- In the email exchange leading up to theWorkshop, a third approach, "MimeticSynthesis", whose prime concern is the "Turingtest" one of representing a computer to a humanuser as if it were another human, wasdistinguished from the two mentioned by RobbyGarner. It is a good distinction, though like theothers, the boundaries are not always clear.

explored and reduced to mathematical methodsto the extent of numerical calculations, or evenof mathematical logic, which might directlyfacilitate a computation centered approach.

Mathematical logic makes an interestingcase for pointing out that most AI researchers inpractice blend the computation centered andcognition centered approaches, since it isformalized, yet still can be approached in acognition centered way. Computers actuallyimplement mathematical logic, which is essentialin control statements of programming languages.However, actually proving theorems in logic(beyond propositional logic, where truth-tablemethods can be used), is a creative intelligentactivity. There, things become more complex, indifferent ways. The first complexity is that is acreative activity and we do not really understandeven how people do it. Secondly, it isinformationally complex: there are inherentundecidability problems in logics of sufficientrichness for most interesting purposes.

In attempts to make it easier for humans toprove theorems, natural deduction methods wereinvented by Gentzen [34] and developed by anumber of people, notably Fitch [52]. In a sense,natural deduction can be thought of as acomputation-oriented version of theoremproving, taking away some of the mental work ofcreativity. But this does not change the inherentinformational complexity problems, whichprovide inherent limits on computability.

Going beyond logic to general problemsolving one finds some empirical studies ofeffective ways in which humans do it thatantedate the computer. One of them, means-endsanalysis, was codified in the General ProblemSolver (GPS) program of Newell and Simon.[63] (See also Ernst and Newell 65]. Forprograms in the GPS era, it was in the spirit ofthat work to attempt measurement of the extentto which the program could mimic humanbehavior. This was done by also studying verbalprotocols of people solving the problem. Anyway of comparing those to the performance ofthe program was still pretty much a surfacemeasurement. Such surface measures ofcognitive performance, are also the heart of theTuring test [Turing 50], but do not tell us muchabout what is happening deeper in the system, asJoseph Weizenbaum showed with Eliza [66](and emphasized in an ironic letter [74]). Inmore recent times, case-based methods havebeen advocated [Kolodner 88] as relating to waysome people solve problems and they do look

50

Page 3: Theoretical Constructs and Measurement of Performance and ...

very promising. Some of the constructsfromthese problem-solving methods will bementioned below.

Though computation centered and cognitivecentered approaches blend well, themeasurements that occur to the developers in thetwo approaches will naturally differ, and this isparticularly true as one tries to go to a deeperlevel by using constructs that are based either oncognition or on computation. In other words, AImay have measurable constructs coming from atleast two different sources, the computation sideand the cognitive side. This fact has someinteresting implications as one looks at themeasurement of deeper constructs, which mayhave to be reconciled with both approaches to bemeaningful.

The Structure of Scientific Theories

Today's views of scientific theory havechanged from those held in the 191hCentury,Lord Kelvin's time. The bare-bones version of ascientific theory today is that it consists of amodel composed of abstract theoreticalconstructs and a calculus that manipulates theseconstructs in a way that can account forobservations and accurately predict the value ofexperiments. The model is as central today aswas the notion of measurement to Kelvin. Thetheoretical constructs have a relation withobserved entities, properties and processes thatmay be quite abstract, not necessarily readilyavailable to human senses, but following directlyfrom calculations based on the theory. There area number of principles applied to a model thatgive us increased confidence in the theory, butthe one most relevant here is that we canmeasure the observed entities to confirm thepredictions of the theories. So Kelvin's concernhas been preserved, but augmented, in today'sview of theories.

It is relevant to observe that the "calculus"mentioned above is used in the dictionary sense"a method of computation or calculation in aspecial notation (as of logic or symbolic logic)".That means that it may be numerical or non-numerical. In fact, as Herb Simon and AllenNewell [65] pointed out, there is no reason thatthe calculus cannot be expressed in the notationof a computer program, the better to speed itsmanipulation of the theoretical constructs.

For scientific theories in AI to berespectable, there will be certain requirements onthem, and these affect whether they are accepted

or not and whether the theories in which theyoccur are accepted. The late Henry Margenauhad a pragmatic treatment of these requirementsin his book The Nature of Physical Reality[Margenau 50]. A working Physicist as well as aphilosopher, Margenau stressed that no amountof empirical evidence was scientificallyconvincing by itself, since it did not specify aunique model; and he also stressed the need forthe binding of theoretical constructs to oneanother in a "fabric". This fabric was made up oftheory and of mappings to empirical data. Thetheory was convincing to the degree that certaincriteria were met - not a "black and white"

situation, but one of degree. One of the criteriawas the extent to which the models andconstructs were extensible to larger and largerareas of scientific endeavor. As the fabric of thetheory became larger and stronger, it becamemore difficult to rip it asunder.

Perhaps our emphasis on finding metrics cansolidify the theoretical constructs of the field, aswell as providing a means of measuringprogress. The key to doing this is not to think ofevaluation only as measurement of somebenchmarks or physical parameters("behaviors") that are manifested in theoperation of the systems being evaluated. Weneed to be thinking in terms of the innerworkings of the systems and how the parameterswithin them relate to the measured externallymanifested behaviors.

One of Lord Kelvin's special interests wastemperature. Temperature is of coursesomething that we experience, something notwholly abstract. Certain physical properties arerelated to temperature, and the most easilyobserved is freezing and boiling of water. It tooksome scientific discovery to realize that each ofthese phenomena always take place at aparticular (with a few reservations, like altitudeand purity of the water), but still, those areconcrete embodiments. Temperature has been asubjective attribute during most of the history ofmankind, but the scientific notion of temperatureis a theoretical construct, even though it has aclose correspondence to subjective experience.The particular metrics chosen related to waterboiling (in both Fahrenheit and Celsius), toFreezing (in Celsius), and to the "coldest"temperature that could be achieved with water,ice and salt (in Fahrenheit). Lord Kelvin alsotook the amazing step of developing a notion oftemperature that is really abstract. His zero pointof minus 273.15 degrees Celsius has never quite

51

Page 4: Theoretical Constructs and Measurement of Performance and ...

been reached, and is far below what any personcould experience. Yet it is very real as ascientific construct, one that is part of the fabricof physical science and ties various aspects ofscience together in that fabric.

Many other common terms in physicaltheory, like mass and gravity, are theoreticalconstructs, though they are related to humansenses. Only in relatively recent physics historyhave mass and gravity been understood, and weowe that understanding to bits of inspiration onthe part of Galileo and Newton. Having onlyhalf a century of AI history to look back on, wecannot really expect to have such a firm fabric oftheoretical constructs stitched together. Butsome ideas are given below, after a comparisonof Sciences that study natural and the artificialsystems.

Sciences of the Artificial and their relation toNatural Sciences

Herbert Simon came to the conclusion thatthere was a place for what he called "Sciences ofthe Artificial" in his important book [69]. He didnot invent the study of artifacts in a systematicmanner, but he realized accurately and acutelythat that artifacts could be subjects of "realsciences", with deep theories of the sort that existin natural sciences. We will now consider someof the implications of this idea.

The boundaries between sciences of theartificial and the natural sciences are not cIear-cut in practice because nature colors humanartifacts, determining their possibility and theirfeatures. The "engineering sciences", theportions of engineering that has been formalizedin the sense of that they can predict the behaviorof artifacts, including aspects such as stabilityand strength can be considered sciences of theartificial. The reason that this is not remarkedupon more often is that they have called uponphysical sciences more and more over thecenturies to aid the "ingenuity" that gives theprofession its name.

Linguistics is a science of the artificial.Human language is the artifact that it studies.But of course, the properties of the artifact areshaped by the natural properties of humanlearning and cognition, human hearing andspeech in many ways. In the domain ofphonetics, for example David Stampe's "naturalphonology" [Stampe 73, Donegan and Stampe79] characterizes some of the interactionsbetween language as an artifact and as a natural

phenomenon. We do not understand even yet theextent of the interaction between linguistics andhuman cognition. Is there an LAD (languageacquisition device) [Chomsky 75] innate inhumans that is specific to language, or is thelearning of language based on the sameprinciples as such other acquired systems asvisual perception? Nobody knows for sure; butwhatever the case, the nature of the world andthe nature of learning processes must affectlanguage.

Computer Science is a science of theartificial. Certainly, this is true insofar as itstudies computers, which are artifacts; but also tothe extent that it studies algorithms, which arehuman creations, too. The main subject studiedin much of Computer Science is not computersbut information, and the "state", which is all therelevant information about a system at a giventime, is therefore a fundamental theoreticalconstruct. Information is a theoretical constructthat is also fundamental in the natural sciences,but whose significance as a theoretical constructhas only become apparent in this century, as itsrelationship to entropy and its role in quantumtheory have been realized. So again, ComputerScience has both artificial and natural parts.

Economics, another science of the artificial,studies a major artifact, the economy, andlooking at this science of the artificial canprovide some insight into the position of AI as ascience of the artificial, and of the role ofmeasurable theoretical constructs.

Predictive Measurement in a Science of theArtificial- An Example from Economics

Economics has struggled for longer than AI hasexisted to find theoretical constructs that havepredictive power. Economics deals with largeamounts of aggregated data, so its empirical dataare statistical in nature, and its models are not asclear as physical models with respect to theinterrelationships among theoretical constructs,nor are they as widely accepted. Yet they doallow some prediction of economic performanceand are used in control processes for thepurposes of economic stability, with a degree ofsuccess.

As this paper is being written, the U.S.Federal Reserve Bank is aggressively raisinginterest rates because the employment rate(inferred from job creation and unemploymentdata) is high and economic growth (a function ofGDP change and other data) has been rapid. In

52

Page 5: Theoretical Constructs and Measurement of Performance and ...

their models, this predicts increasing inflation (asmeasured by the consumer and producer priceindices and other constructs). It has recently beenconjectured that there should be a role in thesemodels for productivity, the role of which is notyet fully understood. So economic theory, as itdevelops, must relate all of these constructs andothers: average interest rates, supply and demandfor money and goods, savings rate, etc.

Economic theories and their constructs arestill complex and incomplete. Incorporated incomplex computer models, their predictions arenot totally trustworthy, but the predictions aretestable. Economics provides an example fromanother science of the artificial that AI shouldfollow in formulating and measuring constructs.

Surface Measures and Theoretical Constructsin AI - Some Examples

The sort of predictive ability that economistswant, we would like to see in AI, too. If we havetheoretical constructs at some deeper level, wecan also use the theories of which they are a partto simulate or predict mathematically whathappens if we increase or decrease parametersrelated to those constructs. It is a thesis of thispaper that there are theoretical constructs thatcan predict system performance measured interms of surface measures. At this point in thedevelopment of AI as science, it is hard to sayjust exactly what they would be, but some ideascan be drawn from today's AI and relatedsubjects.

An Example Construct: Robustness

A surface measurement that could be veryvaluable across a variety of systems is somemeasure of robustness - the ability to exerciseintelligent behavior over a large number of tasksand situations. From a computation-centeredstandpoint, if systems become robust, AIprogress would be easier to see. From acognition-centered standpoint, a system cannever really be intelligent if it is not robust. (Oneway to think of a measure of intelligence in asingle system would be as a measure ofperformance, robustness and autonomy.) Thesurface way to determine the robustness of asystem would be to try it on a number of tasksand see how broad its methods are. But whatmakes intelligent systems robust? Learningability, experience, and the ability to transfer thatexperience to new situations are all things thatcome to mind. A rough sketch of howmeasuring theoretical constructs in those areas

might give us a predictive figure for developingrobust systems is given below.

Robustness: Learning?

If learning can make systems more robust, itshould be interesting to measure the strength ofthe system's learning component. How easilydoes it adapt the system to a new situation?Unsupervised learning has wide applicability,but it can basically only determine clusters ofsimilar items. Supervised learning must bepresented with exemplars to learn relations,which seems not to be enough for a machine toextend its own capabilities. Reinforcementlearning (RL) is a blend of both cognitive andcomputational centered AI. It started out as amodel of classical conditioning, but turned out tobe applied dynamic programming. There are anumber of different techniques within RL, all ofwhich have many possible applications. Neuralnets or other approaches may be used. Thetheoretical constructs include the state spacechosen, the reinforcement function, and thepolicy. The field is becoming quitesophisticated, and there are known facts aboutthe relation of these to outcomes in particularcases [Mahadevan and Kaelbling 96]. Supposethat a reinforcement learning system constitutesa part of the intelligence of an intelligent system.There should be some way of predicting howthat system would do upon encounteringproblems of a certain nature. By knowing how itchooses the concepts in its system and how theyreact on problems of that type, one can provide apartial evaluation of how effective the learningsystem would be. By obtaining such figures forall such subsystems, one could relate them to theperformance of the full intelligent system. Thereis much work to be done in that direction.

Under certain circumstances, one canimagine learning extending robustness; buthaving to learn each new variations of a problem,even by reinforcement, is unlikely to lead torobustness quickly. It is expected that reinforcedbehaviors learned in one situation might beidentical to those needed in another system, sothis may lead to more rapid or better learning inthe second situation. One approach to this is tocondition behaviors that are not built into thesystem initially, as explored by Touretzky andSaksida [97]. But, still, one would like to havemore general ways of reusing "big pieces" oflearned knowledge.

53

Page 6: Theoretical Constructs and Measurement of Performance and ...

Robwrtne~:Tra~erofLeanUng?

Transfer of learning is a phenomenon thatwe may be able abstract to theoretical constructsthat can help to predict robustness. It is still nota deep measure, so it will then be important topredict transfer of learning from deeperconstructs which will be mentioned below. Atpresent, it. is a research challenge to buildtransfer of learning into systems. But it ispossible to see 'how one could test for it.

As far as measurement, here is roughly howtransfer of learning might be measured:

I. Machine performance is measured on Task1. The score is P{tl, Tl)) = performance attime tl on Task 1. P is some suitably broadperformancemeasure. .

2. Performance is measured on Task 2 withoutlearning (this being an artifact where we cancontrol learning) to obtain P(tl, T2)(keeping the time variable the same becausethe same machine abilities are assumedwithout learning even if the measurementsare not simultaneous).

3. Note that if the measure is to have ameaning, previous training that might affectTl or T2 must be controlled for, whichcould be difficult.

4. The machine is now allowed to perform taskTl in which it learns, achieving betterperformance at some time t2, i.e. P (t2, Tl)> P (tl, Tl).

5, It is then tested on T2, and the question iswhether P (t2, T2) > P (tl, T2) withouthaving done additional learning on Task 2.

If indeed P (t2, T2) > P{tl, T2) in somequantifiable way, the system has achieved (atleast locally) one of the goals of AI, the transferof learning from T 1 to T2. The amount oftransfer can be measured by the amount ofimprovement on task2 as a function of theamount of training on task TI. Let us assumethat we can describe this by some transfereffectiveness function, E for the system beingtested. Let us say E{Tl, T2, t) gives "theeffectiveness of training on TI for time t in termsof transfer toT2". We could describe this by agraph of performance on T2 as a function oftime being spent on T 1.

Developing such a measure of transfer oflearning and getting it accepted is not simple. Tobe useful, we would need a way of comparing T 1and T2, to be sure that the second task is not justa subtask to the first. Difficult or not, defined

measurements such as these are steps towardunderstands the construct "transfer of learning"and achieving it in artifacts. The measurabletransfer construct would, in turn, help to providea measurement of robustness, since learningtransfer can make a system more robust. It is astep toward measurement of intelligence, at leastby some definitions of intelligence, and,intuitively, at least, would have some predictivepower.

How might we go about defining thesimilarity of TI and T2, as suggested above? Wewould have to decide what we mean bysimilarity of task. An interesting essay in thisarea is "Ontology of Tasks and Methods"[Chandrasekaran, Josephson and Benjamins[98]].

Various candidates for potentiallymeasurable constructs that could be used toproduce transfer but also to relate transfer toother phenomena are mentioned in a book editedby Thrun and Pratt [98], who have both had aresearch interest in learning-transfer processes.From the computation side comes the possibilityof changing inductive bias. From the cognition-centered side, there is generalization from thingsalready learned; but overgeneralization can be amajor problem in learning, so it needs to beconstrained. (Some simple constraints onovergeneralization in language learning arediscussed in [Reeker 76].)

Robustness: Case-Based Reasoning?

Case-based reasoning is an intuitivelyappealing technique that was mentioned earlierin this paper. The idea is that one learns anexpanding set of cases and stores the essentialsof them away according to their conventionalfeatures. They are then retrieved when a similarcase arises and mapped into the current case.Potential theoretical constructs include indexingand retrieval methods for the cases, caseevaluation and case adaptation to the newsituation. The cases could also be abstracted andgeneralized to various degrees, to a model.

Case-based reasoning is important forcognition centered AI. It is intuitively the waymany people often figure out how to do things,and is thus embodied in the teaching methods ofmany professional fields - law, business,medicine, etc. It provides a launching pad forcreativity as well, as mappings take place fromone case to an entirely new one. Perhaps thenew case is not really concrete, but a vague new

54

Page 7: Theoretical Constructs and Measurement of Performance and ...

idea. Then the mapping of an old case to it mayresult in a creative act - what we usually callanalogy. Analogy, metaphor in language, is arich source - absolutely ubiquitous - of newmeanings for words, and thus of new ways todescribe concepts, objects, actions. Perhaps onekey to robustness is the ability to use analogy.Four interesting papers by researcher in the areacan be found in an issue of AmericanPsychologist [Gentner et al97 ].

Existing Surface and Subsurface PerformanceMeasures

Researchers in text-based informationretrieval (IR) have traditionally consideredthemselves not to be a part of the AI field, andsome have even considered that artificialintelligence was a rival technology to theirs; butthere is an overlap of interest. It is worth notingthat IR has had a useful surface measure ofsystem performance that has guided research andallowed comparison of technologies. Themeasure consists of two numbers, recall andprecision [Salton 71]. Recall measures thecompleteness of the retrieval process (thepercentage of the relevant documents retrieved).Precision measures the purity of the retrieval (thepercentage of retrieved documents judgedrelevant by the people making the queries). Ifboth numbers were 100%, all relevant documentsin a collection would be retrieved and none ofthe irrelevant ones. Generally, techniques thatincrease one of the measures decrease the other.Real progress in the general case is achieved ifone can be increased without decreasing theother.

For the IR community, better recall andprecision numbers have both shown the progressof the field. They also show that it is still fallingshort, keeping up the challenge, especially as theneed to use it for very large information corporarises. In addition, they provide a standard withinthe community for judging various alternativeschemes. Given a particular text corpus, one canconsider various weighting schemes, use of athesaurus, use of grammatical parsing that seeksto label the corpus as to parts of speech, etc., toimprove the retrieval process. The interestingthing is to relate these methods and thecharacteristics of the corpus to precision andrecall, but so far that has not been sharp enoughto quantify generally.

Related to information retrieval is automatednatural language information extraction, whichtries to find specified types of information in

bodies of text (often to create formatteddatabases where extracted information can beretrieved or mined more readily). A related butdifferent (cost-based) measure was definedseveral years ago for a successful informationextraction project [Reeker, Zamora and Blower83]. One measure was robustness (over thetexts, not different tasks as in the broaderintelligent systems usage discussed earlier). Thiswas defined as the percentage of documents outof a large collection that could be handledautomatically. The idea was that somedocuments would be eliminated throughautomated pre-screening (because thosedocuments were not described by the discoursemodel the system used) and relegated to humanprocessing. Another measure was accuracy (thepercentage of documents not eliminated thatwere then correctly processed in their entirety,by the system). Yet another was error rate (thepercentage of information items that wereerroneous - including omitted - in incorrectlyhandled documents). From this more detailedbreakdown, estimates of the basic cost ofprocessing the documents, based on human andmachine processing costs and costs assigned toerrors and omissions, was derived. The measurecould be used to drive improvements ininformation extraction systems or decide whetherto use them, compared to human extraction(which also has errors) or to improve thediscourse model to handle a larger portion.

For information extraction projects, it wasfurther suggested that the cost of erroneousinputs might drive a built-in "safety factor" thatcould be varied for a given application [Reeker85]. This safety factor was based on linguisticmeasures of the text (in addition to the discoursemodel) that could cause problems for the systembeing studied. The adjustable safety factor couldbe built into the prescreening mentioned above.In other words, the system would processautonomously to a greater or lesser degree andcould invite human interaction in applicationswhere the cost of errors was especially high. Itwas suggested that the system would place"warning flags" to help it make a decision onscreening out the document, and these could alsoaid the human involved. Although this was atentative piece of work, the idea of tying asurface measure (robustness) into the underlyingproperties of the system is exactly like tyingmeasurable surface properties into underlyingtheoretical constructs. The theoretical constructsmentioned in this case were structural orsemantic ones from linguistics.

55

Page 8: Theoretical Constructs and Measurement of Performance and ...

From the area of software engineeringcomes another tradeoff measure that is worthmention. The author did some work on ways ofproviding metrics -surface metrics, initially - forprogram readability (or understandability)[Reeker, 79]. Briefly, studies of programunderstanding had identified both go-tostatements and large numbers of identifiers(including program labels) as problems. At thesame time, the more localized loop statementscould result in deep embeddings that were alsodifficult to understand for software repair ormodification. The vague concept of readabilitycould be replaced by a measure of go-tostatements and maybe also one of the number ofdifferent identifiers. This particular studysuggested depth of embedding as a problem andalso suggested a tradeoff between depth ofembedding a metric called identifier load.Identifier load was a function of the number ofidentifiers and the span of program statementsover which they were used. Identifier loadtended to increase as depth of embedding wasreduced by the obvious methods.

There were a number of similar softwaremetrics studies in the 1970s, and they continue.This approach, however, was part of an attemptto look at natural language for constructs thatmight be of relevance in programming languagesand programming practice [Reeker 80]. Thedepth measure was based on an idea of VictorYngve [60], which came out of his work inlinguistics - an idea that retains a germ ofintuitive truth. Yngve had in turn related hisnatural language measure of embedding depth tomeasures of short-term memory from cognitivepsychology. Whether these relationships turnout to be true or lead to related ideas that are trueor not, they illustrate how theoretical constructscan stitch AI, computer science, and otherartificial and natural sciences together. Theyalso illustrate the quest for metrics that can firmup the foundations of the sciences.

More Constructs To Be Explored

There are many more existing theoreticalconstructs that have arisen within AI or beenimported from computer science or cognitivescience that beg to be better defined, quantified,and related to other constructs, both deep andsurface.

Means-ends analysis and case basedreasoning have both been mentioned as forms ofproblem solving. How do these cognitivecharacterizations of problem solving relate to

one another? At a deeper level is the constructof short term memory mentioned in the previoussection in relationship to Yngve's depth. Howdoes short-term or working memory relate tolong term memory and how are the two used inproblem solving? The details are not known.The size of a short-term memory may not be asrelevant in a machine, where memory is cheapand fast. But we cannot be sure that it is notrelevant to various aspects of machineperformance because it is reflected at least in thehuman artifacts that the machine may encounter.For instance, in resolving anaphora in naturallanguage the problem may be complicated ifpossible referents are retrieved from arbitrarilylong distances.

A similar problem arises from long-termmemory if everything ever learned about aconcept is retrieved each time the concept issearched for. This can lower retrieval precision(to use the term discussed earlier for machineretrieval) and cause processing difficulties on agiven problem. It may be that Simon's notion ofbounded rationality is a virtue in employingintelligence. Are we losing an importantparameter in intelligence if we try always tooptimize rationality? For AI system, anytimealgorithms and similar constructs forapproximate, uncertain, and resource boundedreasoning have been developed in recent years,and hold a good deal of promise [Zilberstein 96].

An interesting theoretical construct arisingout of AI knowledge representation and theattempts to use it in expert systems and agentsand for other purposes is that of an ontology."Ontology" is an old word in philosophydesignating an area of study. In AI it has cometo designate a type of artifact in an intelligentsystem: The way that that system characterizesknowledge. In humans, ontologies are shared toa large degree, but certainly differ from everyperson to every other, despite the fact that wecan understand each other. Are some ontologiesindicative of more intelligence than others inways that we can measure? One suggestedcriterion for high intelligence is the ability tounderstand and use very fine distinctions (or toactually create new ones, as described in Godel'smemorandum cited by Chandrasekaran andReeker [74]). Is an ontology's size important, orits organization, or both? Can one quantify asystem's ability to add new distinctions?

A related issue is vocabulary. Many peoplethink that an extensive vocabulary, usedappropriately, is a sign of intelligence, or at least

56

Page 9: Theoretical Constructs and Measurement of Performance and ...

VI-..J

I. By this we mean the traditional cognitive psychology level, not brain function. The latter is a biological approach. In the nature of science, of course, oneexpects theories of such close areas to be consistent and to inform one another, and to merge in the longer term.

2. Clearly, each of these endeavors is different, though each can make use of knowledge from the others and some devices could solve all goals. The confusionbetween them, however, has resulted in misunderstandings for decades.

Characterizing Three Related Endeavors Involving Computers and Intelligence and Their Purposes"

Name of Mimetic Synthesis Cognitive Modeling Artificial IntelligenceEndeavorPrincipal Goal Produce behavior that appears to Produce models of cognitive Find ways of doing with

be evidence of cognition or processes, including learning, computers things that we deemintelligent phenomena planning, resoning, perception, intelligent when they are done by

linguistic behavior, etc. humans.Use Produce illusion of intelligent Develop psychological theories of Tools to augment intelligence and

behavior for interface purposes, cognition. 1 systems that exhibit increasinglyentertainment, etc. intelligent behavior autonomously.

Category of Computer Technology Psychological Science (Branch of Computer Science (Branch ofEndeavor Natural Science) Science of Artificial)Approach Use simulations, stored answers, Use evidence from psychological Use techniques from mathematical

AI or cognitive models, or other experiments; make working and engineering disciplines,techniques that are convincing to models; test against human cognitive models, and previoushuman users. behavior. experience. Test through

programs.Examples Eliza (J. Weizenbaum), Albert LT, GPS (Simon and Newell), Deep Blue (ffiM), SATPLAN

(Gamer and Henderson), Talking HAM (1. Anderson), SOAR (A. (Kautz and Selman), DendralCoke Machines (?),. .. Newell),.. . (Buchanan, Feigenbaum..),

TD-Gammon (Tesauro),...

Page 10: Theoretical Constructs and Measurement of Performance and ...

scholastic aptitude. In computer programs thatdo human language processing, the vocabularyconsists of a lexicon that generally also hasstructural (syntactic) information for parsing orgenerating utterances containing the lexical itemand meaning representations for the lexical item.The lexicon can be much larger than anyhuman's vocabulary; but for the vocabulary to beused appropriately for language production orunderstanding, it still falls far short of the humanvocabulary. For that to be improved bettertechniques of semantic mapping are required,including links to ontologies and methods ofinferring the ontological connections and ofidiosyncratic aspects of speakers with which aconversation is taking place. Is the vocabulary anindication of the size of the ontology and thedistinctions it makes, or vice-versa? Nobodyknows; but better theories of how they link upare needed for both understanding and fullyeffective use of human language by intelligentsystems.

Another cognitive concept that is still amystery is creativity, certainly a part ofintelligence, or at least of high intelligence.Does the ability to add entirely new concepts,not taught, constitute creativity? How does oneharness serendipity to develop creativity? Iscreativity linked with sensory cognition, thecognitive phenomena related to senses, such asvision, including perception, visual reasoning,etc. There is a need for deep theoreticalconstructs underlying notions like creativity, andfor measures of these constructs and theirattributes [Simon 95, Buchanan 00].

Turning to computational constructs, wenotice that much of the AI described above takesplace through various forms of search. Alreadythere exists a pretty good catalogue of variationson search and how to manage it, in which a gooddeal of theory is latent. Some of the search is ofa state space, involving the ubiquitous stateconcept basic to theoretical computer science.Search is also coupled with pattern matching,which underlies many of the methods mentionedearlier in this paper.

The potential constructs mentioned here arejust a sample of the ones already available inArtificial Intelligence, and to them should beadded others found in some of the major worksof Newell and Simon on Problem Solving andCognition [Newell and Simon [65], Newell[87]].

Summary and Author's Note

The development of a true science ofartificial intelligence is something that hasconcerned the author for a long time. It has beenencouraging to see the development within thefield of interesting and non-obvious theoreticalconstructs. This paper has suggested thattheoretical constructs with attributes that we canmeasure are especially valuable and it hassuggested a number of such candidates. Thepaper suggests that we enlist Lord Kelvin'semphasis on measurement in choosing suchconstructs. These same measurable theoreticalconstructs will in many cases relate (at least atdeeper levels) to those of cognitive science,computer science, and other sciences. They willhelp predict measures at the surface that can beused to provide metrics for the performance (andthrough that, the intelligence) of intelligentartifacts. We should have in mind the quest forsuch measurable constructs as we move forwardin creating intelligent artifacts.

References

Buchanan, B. G. [00], Creativity at the Meta-Level,Presidential Address, American Association forArtificial Intelligence, August 2000. [Forthcoming inAI Magazine.]

Chandresekaran, B., 1. R. Josephson and V. R.Benjamins [98] Ontology of Tasks and Methods, 1998Banff Knowledge Acquisition Workshop. [RevisedVersion appears as two papers "What are ontologiesand why do we need them?," IEEE IntelligentSystems, JanlFeb 1999, 14(1); pp. 20-26; "Ontology ofTask and Methods," IEEE Intelligent Systems,May/June, 1999.]

Chandrasekaran, B. and L. H. Reeker [74]. "ArtificialIntelligence - A Case for Agnosticism," IEEE Trans.Systems, Man and Cybernetics, January 1974, Vol.SMC-4, pp. 88-94.

Chomsky, Noam [65]. Aspects of the Theory ofSyntax. MIT Press, Cambridge MA.

Chomsky, N. [75] Reflections on Language. RandomHouse, New York.

Donegan, P. J. & D. Stampe [79]. The study ofNatural Phonology. In Dionsen, Daniel A. (ed).Current Approaches to Phonological Theory. IndianaUniversity Press, Bloomington, 126-173.

Ernst, G. & Newell, A. [69]. GPS: A Case Study inGenerality and Problem Solving. Academic Press,New York.

58

Page 11: Theoretical Constructs and Measurement of Performance and ...

Fitch, F. B. [52]. Symbolic Logic. Roland Press, NewYork.

Gentner, D., K. Holyoak et al [97]. Reasoning andlearning by analogy. (A section containing thisintroduction and other papers by these authors and A.B. Markman, P. Thagard, and J. Kolodner, AmericanPsychologist, 52(1), 32-66.)

Gentzen, G. [34] Investigations into logicaldeduction. The Collected Papers of Gerhard Gentzen,M. E. Szabo, ed. North--Holland, Amsterdam, 1969.[Published in German in 1934.]

Kolodner, J.L. [88] Extending Problem SolvingCapabilities Through Case-Based Inference, In,Proceedings of the DARPA Case-Based ReasoningWorkshop, Kolodner, J.L. (Ed.), Morgan Kaufmann,Menlo Park, CA.

Leake, D. B. Ed. [96] Case-Based Reasoning:Experiences, Lessons, And Future DirectionsIndiana University, Editor 1996, AAAI PresslMITPress, Cambridge, MA.

Margenau, H. [50]. The Nature of Physical Reality,McGraw-Hill, New York.

Mahadevan, S. and L. P. Kaelbling [96] The NationalScience Foundation Workshop on ReinforcementLearning. AI Magazine 17(4): 89-93.

Meystel, A. et al [00] Measuring Performance ofSystems with Autonomy: Metrics for Intelligence ofConstructed Systems. In this volume.

Newell, A. [87] Unified Theories of Cognition.Harvard University Press. Cambridge, Massachusetts,1990. [Materials from William James Lecturesdelivered at Harvard in 1987.]

Newell, A., and H.A. Simon [63] GPS, a program thatsimulates human thought, Computers and Thought,E.A. Feigenbaum and J. Feldman (Eds.), McGraw-Hill, New York.

Newell, A., & Simon, H.A. [65]. Programs as theoriesof higher mental processes. R.W. Stacy and B.Waxman (Eds.), Computers in biomedical research(Vol. II, Chap. 6). Academic Press, New York.

Newell A. & Simon H.A. [72] Human ProblemSolving, Prentice-Hall, Englewood Cliffs, NJ.

Reeker, L. The computational study of languageacquisition, Advances in Computers, 15 (M. Yovits,ed.), Academic Press, 181-237, 1976.

Reeker, . L. [79] Natural Language Devices forProgramming Language Readability; Embedding andIdentifier Load, Proceedings, Australian ComputerScience Conference, Hobart Tasmania.

- --

Reeker, L., E. Zamora and P. Blower [83] SpecializedInformation Extraction: Automatic Chemical ReactionCoding from English Descriptions, Proceedings of theSymposium on Applied Natural Language Processing,Santa Monics, CA, Association for ComputationalLinguistics, 1983.

Reeker, L. H. [80] Natural Language Programmingand Natural Programming Languages, AustralianComputer JoumalI2(3): 89-93,

Reeker, L. H. [85] Specialized information extractionfrom natural language texts: The "Safety Factor",Proceedings of the 1985 Conference on IntelligentSystems and Machines, 318-323, Oakland University,Michigan, 1985.

Salton, G. [71] The SMART Retrieval System,Prentice-Hall, Englewood Cliffs, N1.

Simon, H. A. [69] The Sciences of the Artificial. ThirdEdition. Cambridge, MA, MIT Press, 1996. [Originalversion published 1969].

Simon, H. A. [95] Explaining the Ineffable: AI on theTopics of Intuition, Insight and Inspiration.Proceedings of the Fourteenth International JointConference on Artificial Intelligence, IlCAI 95,Montreal, Morgan Kaufmann, Menlo Park, CA,Volume 1,939-949.

Stampe, D. [73] A Dissertation on NaturalPhonology. New York: Garland Publishing, 1979.[Original University of Chicago dissertation submittedin 1973.]

Thrun, S. and L. Pratt (eds.) [98]. Learning To Learn.Kluwer Academic Publishers.

D.S. Touretzky and L.M. Saksida [97] Operantconditioning in Skinnerbots. Adaptive Behavior5(3/4):219-247.

Turing, A. M. [50]. "Computing Machinery andIntelligence." Mind LlX (236 ; Oct. 1950): 433-460reprint in [Collected Works of A. M. Turing vol. 3:Mechanical Intelligence, D. C. Ince ed., ElsevierScience Publishers, Amsterdam, 1992: 133-160].

Weizenbaum J [66]. ELIZA - a computer program forthe study of natural language communication betweenman and machine. Communications of the ACM, 9,36-45.

Weizenbaum, 1. [74]. Automating psychotherapy.Communications of the ACM, 17(7):425, July 1974.

Yngve, V [60] The depth hypothesis, Proceedings,Symposia in Applied Mathematics, Vol. 12:Providence, RI, American Math. Society, 1961.[Based on publication under another title, 1960.]

Zilberstein, S. [96] Using Anytime Algorithms inIntelligent Systems, AI Magazine, 17(3):73-83, 1996.

59