On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing...

On WordNet, Text Mining, and Knowledge Bases of the Future

Peter Clark

Knowledge SystemsBoeing Phantom Works

On Machine Understanding

• Creating models from data…

• Suggests:• there a rocket launch• China owns the satellite• the satellite is for monitoring weather• the orbit is around the Earth• etc.

None of these are explicitly stated in the text

“China launched a meteorological satellite into orbit Wednesday, the first of five weather guardians to be sent into the skies before 2008.”

On Machine Understanding• Understanding = creating a situation-specific model

(SSM), coherent with data & background knowledge– Data suggests background knowledge which may be

appropriate– Background knowledge suggest ways of interpreting data

Fragmentary,ambiguous

inputs

Coherent Model(situation-specific)

?

?


Fragmentary,ambiguous

inputsCoherent Model

(situation-specific)

?

?

• Core theories of the world• Ton of common-sense/ episodic/experiential knowledge (“the way the world is”)

Assembly of pieces, assessment of coherence,inference

• Only a tiny part of the target model• Contains errors and ambiguity• Not even a subset of the target model

World Knowledge


• Conjectures about the nature of the beast:– “Small” number of core theories

• space, time, movement, …• can encode directly

– Large amount of “mundane” facts• a dictionary contains many of these facts• also: episodic/script-like knowledge needed


World Knowledge


• How to acquire this background knowledge?– Manual encoding (e.g., Cyc, WordNet)– NLP on a dictionary (e.g., MindNet)– Community-wide acquisition (e.g., OpenMind)– Knowledge mining from text (e.g., Schubert)– Knowledge acquisition technology:

• graphical (e.g., Shaken)• entry using “controlled” (simple) English


World Knowledge

What We’re Trying To Do…

Knowledge base

English-baseddescription ofa scene (partial,ambiguous)

Coherent representationof the scene (elaborated,disambiguated)

Question-Answering, Search, etc.

“A man pulls and closes an airplane door”

“A lever is rotated to the unarmed position”

“…” “…”

Pull

Man Door Airplane

agent object

is-part-of

Video

Captions(manualauthoring)

Caption textInterpretation

Elaboration (inference,scene-building) Pull

ManDoor Airplane

World Knowledge

SearchTouch

Person Door

Query:

Illustration: Caption-Based Video Retrieval

Demo

http://localhost:8003/video-startup

Some Example Inferences“Someone broke the side mirrors of a truck” the truck is damaged

…if only the system knew that…

IF X breaks Y THEN(result) Y is damaged

IF X is-part-of YAND X is damagedTHEN Y is damaged

A mirror is part of a truck

Some Example Inferences“A man cut his hand on a piece of metal” the man is hurt


IF organism is cutTHEN(result) organism is hurt

IF X is-part-of YAND X is cutTHEN Y is cut

A hand is part of a person

(also: Metal can cut)

Some Example Inferences“A man carries a box across the floor” a person is walking


IF X is carrying somethingTHEN X is walking

IF X is a manTHEN X is a person

Some Example Inferences“The car engine” The engine part of a car“The car fuel” The fuel which is consumed by the car“The car driver” The person who is driving the car


Cars have enginesCars consume fuelPeople drive carsA driver is a person who is driving

Some Example Inferences“The man writes with a pen” Pen = pen_n1 (writing implement)“The pig is in the pen” Pen = pen_n2 (container (typically) for confining animals)


people writewriting is done with writing implementsa pen (n1) is a writing implement

a pig is an animalanimals are sometimes confineda pen (n2) confines animals

Some Example Inferences“The blue car” The color of the car is blue


physical objects can have colorsblue is a colora car is a physical object

WordNet…• psycholinguistically motivated lexical reference system• core:

– synsets (“concepts”)– hypernym links

• later added additional relationships:– part-of– substance-of (“pavement is substance of road”)– causes (“revive causes come-to”)– entails (“sneeze entails exhale”)– antonyms (“appearance”/“disappearance”)– possible-values (“disposition” = {“willing”,“unwilling”})

• Currently at version 2.0– What will version 10.0 (say) look like?– What should it look like? – Is it/coult it migrate towards more of a knowledge base?

Why to Use WordNet• It’s a comprehensive ontology

– (approx. 120,000 concepts)• links between concepts (synsets) and lexical items (words)• Simple structure, easy to use• Rich population of hypernym links

Problems with WordNet• Too fine-grained word senses

– e.g., “cut” has 41 senses, including• cut (separate)• cut grain• cut timber• cut my hair• cut (knife cuts)

– linguistically, not representationally, motivated• e.g., “cut grain” because just happens to be a word for it (“harvest”)

– representationally, many share a core meaning• but commonality not captured

• Missing concepts/senses without an English word (+ just forgot)– e.g., goal-oriented entity (person, corporation, country)– difference between physical and legal ownership

• Single inheritance (mainly)– very different to Cyc, which uses multiple inheritance a lot

Problems with WordNet• “isa” (hypernym) hierarchy is broken in many places

– sometimes means “part of”• e.g., Connecticut -> America

– mixes instance-of and subclass-of• e.g., Paris -> Capital_City -> City

– many links seem strange/questionable• e.g., “launch” is a type of “propel”?• again, is psychologically not representationally motivated

– has major implications for reasoning• Semantics of relationships can be fuzzy/asymmetric

• “motor vehicle” has-part “engine” means….?• “car” has-part “running-board”

Problems with WordNet• Many relationships missing

– Simple• verbs/nominalizations (eg. “plan” (v) vs., “plan” (n))• adverbs/adjectives (“rapidly” vs. “rapid”)

– Conceptual; many, in particular:• causes• material• instrument• content• beneficiary• recipient• result• destination• shape• location

What we’d like instead….

• Want a knowledge resource that can provide rich expectations about the world– to help interpret ambiguous input– to infer additional facts beyond those in the input– to create a coherent model from fragmented input

• It would have– a small set of “core” theories about the world

• containers, transportation, movement, space• probably hand-built

– many “mundane” facts which instantiate those theories in various ways

What we’d like instead…. Core theories, e.g., Transportation:

OBJECTS can be at PLACESVEHICLES can TRANSPORT OBJECTS from PLACE to PLACETRANSPORT requires the OBJECT to be IN the VEHICLEBEFORE an OBJECT is TRANSPORTED by a VEHICLE

from PLACE1 to PLACE2, the OBJECT and VEHICLE are at PLACE1

etc.

Basic facts/instantiations: Cars are vehicles Cars can transport people Cars travel along roads Ships can transport people or goods Ships can transport over water between ports Rockets can transport satellites into space ….

Some Questions

• To what extent can a rulebase be simplified into a set of database-like tables?

• How much of the table-like knowledge can we learn automatically?

• How can we reason with “messy knowledge” that such a database inevitably contains?

• How can we represent different views/perspectives?• What not use Cyc?• How can we address WordNet’s deficiencies

efficiently?

The “Common Sense” KB• An attempt to rapidly accumulate some core

knowledge and “routine” facts, to support– specific applications – research in how to work with all this knowledge

• Features:– knowledge (mainly) entered in simple English– interactively interpreted to KM (logic) structures– using WordNet’s ontology + UT’s “slot” library

Why Simple Language-based Entry?• Seems to be easier and faster than formal encoding

– but more restricted• More comprehensible & accessible• Viable (if a dictionary is a good model of scope…)• Ontologically less commital (can reinterpret)• Forces us to face some key issues

– ambiguity, conflict, “messy” knowledge• Step towards more extensive language processing

• Costs: more infrastructure needed, limited expressivity, still need to understand some KR

Or…• Can (at least some) of this basic world knowledge

be acquired automatically? e.g.,

– Girju

– Etzioni

– Schubert

Knowledge Mining

There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content, and which can be harnessed.

“The camouflaged helicopter landed near the embassy.” helicopters can land helicopters can be camouflaged

Schubert’s Conjecture:

Our attempt: “lightweight” LFs generated from ReutersLF forms: (S subject verb object (prep noun) (prep noun) …) (NN noun … noun) (AN adj noun)

Knowledge Mining

HUTCHINSON SEES HIGHER PAYOUT. HONG KONG. Mar 2.Li said Hong Kong’s property market remains strong while its economy is performing better than forecast. Hong Kong Electric reorganized and will spin off its non-electricity related activities. Hongkong Electric shareholders will receive one share in the new subsidiary for every owned share in the sold company. Li said the decision to spin off …

Newswire Article

Shareholders may receive shares.

Companies may be sold.

Shares may be owned.

Implicit, tacit knowledge

Knowledge Mining – our attempt

(S "history" "fall" ("out of" "view"))(S "rate" "fall on" (NIL "tuesday") ("to" "percent"))(S "index" "rally" "point" ("with" "volume") ("at" "share"))(S "you" "have" "decline" ("in" "inflation") ("in" "rate"))(S "you" "have" "decline" ("in" "figure") ("in" "rate"))(S "you" "have" "decline" ("in" "lack") ("in" "rate"))(S "Boni" "be wary")(S "recovery" "be" "led")(S "evidence" "patchy")(S "expansion" "be worth")(S "we" "be content")(S "investment" "boost" "sale" ("in" "zone"))(S "Eaton" "say" (S "investment" "boost" "sale" ("in" "zone")))(S "it" "grab" "portion" ("away from" "rival"))

Fragment of the raw data (Reuters)

Knowledge Mining…what next?• What could we do with all this data?

– Use it to bias the parser – Extra source of knowledge for the KB

• source of input sentences for our system?• possibilities (“this deduction looks coherent”)

• But:– Ambiguity makes it hard to use

• word senses, relationships

– No notion of “relevance”– Many types of knowledge not mined

• e.g., rule-like, script-like

Summary• Machine understanding = building a coherent model

• Requires lots of world knowledge

– core theories + lots of “mudane” facts

• WordNet –

– a potentially useful resource, but with many problems

– slowly and manually becoming more KB-like

• there’s a lot of potential to jump ahead with text mining methods

– e.g., Schubert’s approach

– KnowItAll

• We would like to use the results for reasoning!!

On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing...

Documents

Transcript of On WordNet, Text Mining, and Knowledge Bases of the Future Peter Clark Knowledge Systems Boeing...