Post on 14-Dec-2015
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
SRI-Boeing’s Reading to Learn Seedling
• Goal:– study issues in learning through reading by working with a
reduced version of the problem, namely working with controlled, rather than unrestricted natural language. The NLP task is factored into two:
• full NL → CL, CL → logic
• Rationale:– by sidestepping some of the shallow linguistic issues of full
NLP, can focus on deeper issues– methods for full NL → CL can be studied separately
this project
SRI-Boeing’s Reading to Learn Seedling• Approach:
– Rewrite 5 pages of chemistry text into our controlled language, CPL
– Extend and use our CPL interpreter to generate logic
– Integrate this new knowledge with an existing chemistry knowledge base (from the Halo Pilot), which has the new knowledge surgically deleted from it
– Report on the problems encountered and solutions developed
This Seedling in Mobius
KnowledgeIntegration
Introspection
Natural LanguageProcessing
TestGeneration
This seedling
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
Recap: October 2005
• Tutorial on the 5 pages of chemistry text– Acid-base reactions, proton transfer
• Where is that knowledge in the text? – Wanted: Clear, declarative statements– Got: obscure/missing/complex/indirect
• Where is that knowledge in the Halo KB?– Wanted: Modular, constructed from general pieces– Got:
• buried in procedures and code• Very hard to ablate or extend
– Suggestions for a better KB structure
(every Compute-Conjugate-Acid has (input ((a Chemical with (plays ((a Base-Role)))))) (parent_formula ((the term of (the nested-atomic-chemical-formula of
(the has-basic-structural-unit of (the input of Self))))))
(target-unit ((if (the parent_formula of Self) then (:set (#'(LAMBDA () (GET-CONJUGATE-ACID-ATOMIC-FORMULA-BACK
(KM0 '(|the| |parent_formula| |of| |Self|))))))))) (output
((if (oneof (the input of Self) where (It isa H2O-Substance)) then (a H3O-Plus-Substance)
else ((forall (allof2 (the target-unit of Self)
where ((not (It2 = (the parent_formula of Self)))))
(the output of (a Identify-Chemical with (input ((a Chemical with
(has-basic-structural-unit ((the output of
(a Identify-Chemical-Entity with (input ((a Chemical-Entity with
(nested-atomic-chemical-formula ((a Chemical-Formula with (term (It)))))))))))))))))))))))
?
“An acid = a base + a proton”
(every Acid-Role has (intensity ( (a Intensity-Value with (value (
(:pair ;; Case statement for Acids. (if ((the played-by of Self) isa Ionic-Compound-Substance) then (if (((the played-by of Self) isa HCl-Substance) or
((the played-by of Self) isa HBr-Substance) or ((the played-by of Self) isa HI-Substance) or ((the played-by of Self) isa HClO3-Substance) or ((the played-by of Self) isa HClO4-Substance) or ((the played-by of Self) isa H2SO4-Substance) or ((the played-by of Self) isa HNO3-Substance)) then *strong else (if (((the played-by of Self) isa H3PO4-Substance) or
((the played-by of Self) isa HF-Substance) or((the played-by of Self) isa HC2H3O2-Substance) or((the played-by of Self) isa H2CO3-Substance) or
Relative strengths of different acids
• Two CPL versions: (i) close to text (ii) close to inference– Predictable performance
• Discussion of “bridging the gap”
Recap: March 2006
IF there is an equation of a reaction AND a first chemical entity has a chemical formulaAND a second chemical entity has a second chemical formulaAND the first chemical formula is part of the left side of the equation…..THEN the direction of the reaction is rightAND the equilibrium side of the reaction is right.
Manually bridging the “gap”
Inference-Supporting CPL:Predictable Performance
Conjugate pairs
Relative strengths
Labelling acid/bases in a reaction
Computing direction of the reaction
Giant KM procedure for formula manipulation
Qualitative absolute strengths (strong/weak/negligible)
+ qualitative comparison
Giant KM procedure for reaction manipulation
KM rule
Task Halo KB
Lookup table
Relative strength assertions
if-then rule using conjugate pairs
if-then rule
CPLMore general
≈≈
(equivalent)
Questions and Tasks from Last Time• Analysis of “the gap”
– What is the nature of the gap?– Can we characterize it?– Can we quantify it?
• AP chemistry vs. grade-school biology
– How does the gap look in different texts? Domains?– What are the fundamental problems?– How severe are they?– How might they be overcome?
• Case Studies– Given
• text/naïve CPL formulation A• Inference-capable target B
– What knowledge is needed to get from A to B?– How much can be pump-primed, how much bootstrapped?
I: Understanding Language
KnowledgeIntegration
Introspection
NaturalLanguageProcessing
TestGeneration
This seedling
Natural and Controlled Languages• Where is Reading to Learn/Mobius’s Achilles’ heel?
– Schubert: “Dealing with real natural language”– Not (just) the grammatical complexity– It is the imprecision, messiness, incompleteness, and
erroneous nature of real language
• Two styles of CPL usage:(i) As a declarative rule language(ii) As grammatically simpler real language
• Worked with both within this Seedling(i) does inference, but is far from original text(ii) is close to the text, but barely supports inference
(i) CPL as a declarative rule language
“IF a first chemical is stronger than a second chemical AND the second chemical is stronger than a third chemical THEN the first chemical is stronger than the third chemical.”
“IF there is an equation of a reaction AND a first chemical entity has a chemical formulaAND a second chemical entity has a second chemical formulaAND the first chemical formula is part of the left side of the equationAND the second chemical formula is part of the right side of the equationAND the first chemical entity is playing a base roleAND the second chemical entity is playing a base roleAND the first chemical entity is stronger than the second chemical entity THEN the direction of the reaction is rightAND the equilibrium side of the reaction is right.”
(ii) CPL as grammatically simpler real languageAcids have a sour taste.
Acids cause some dyes to change color.
Bases have a bitter taste.
Bases have a slippery feel.
All acids contain hydrogen.
37 percent of the mass of concentrated hydrochloric acid is HCl.
The concentration of HCl in concentrated hydrochloric acid is 12 M.
HCl reacts with NH3 without an aqueous solution.
The reaction transfers a proton from an HCl molecule to an NH3 molecule.
The "HX" in Equation 16.6 donates a proton.
The donating leaves behind an X-minus ion.
The X-minus ion plays a Bronsted-Lowry base in the reverse reaction.
The H2O molecule in Equation 16.6 accepts a proton.
The accepting produces an H3O-plus ion.
Two Paths from Language to Logic…
Declarative CPL rulesInference-supporting
Representation
“The Knowledge
Gap”
Real TextReal(istic) CPL Text
Literal/messy logic representation
“Israel’s Problem”
Real(istic) CPL Text
Inference-supporting
Representation
“The Knowledge
Gap”
Real TextLiteral/messy logic representation
Assume a perfect algorithm for English to (literal-like) logic. Are you done?
Declarative CPL rules
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
An Analysis of the Gap
• What is the nature of the gap?• Can we characterize it?• Can we quantify it?• How does the gap look in different texts? Domains?• What are the fundamental problems?• How severe are they?• How might they be overcome?
Analysis• Looked at these phenomena in two sets of text
– 5 target pages of AP chemistry– 5 pages of grade-school level biology
• from the Web, about the heart and its function
• Categorization of main causes• Loose quantification of their frequency
9 Fundamental Causes of the Gap
1. Many idiomatic words/phrases, each requiring a theory
2. Some knowledge is taught by example
3. Much important knowledge is conveyed by diagrams and tables
4. Generic sentences are ubiquitous
5. Some text teaches problem-solving knowledge
6. Discourse context is important (need sentence context)
7. Many sentences pose major representational challenges
8. Math/Algebraic models are extremely challenging
9. Text is full of ambiguity, metaphor and metonymy/loosespeak
1. Idiomatic/special-purpose words/phrases• Many words/phrases require special interpretation
– Breadth requirement is very challenging!• 70% in chem, 40% in bio
– Chemistry• “The reaction favors transfer of…”• “From the earliest days of experimental chemistry…”• “The ion, however, more closely represents reality”• “When we closely examine the reaction…”• “According to their definition…”
– Biology• “This is important for the cells to do their work.”• “On its way back to the heart…”• “The right-side pumps stale blood…”• “to smaller and smaller branched tubes…”
2. Examples• Examples play a key role in human teaching• How important are these for a machine?
– Consolidation, verification, disambiguation?• 35% chem, <5% bio
3. Diagrams and Tables
“Teaching” how to compute conjugate acid/base pairs
Relative strengths of acids
• 10% in chemistry but key ones!!! Incidental in bio.• Show-stopper for some needed knowledge
4. Generics• Reference to a collection rather than individual object• Ubiquitous! 90% chemistry, 95% biology
– Chemistry• “Acids cause certain dyes to change color”• “Acids have a bitter taste”• “A substance that is …. is called amphoteric”
– Biology• “The blood leaving the aorta is full of oxygen”• “Veins have thin walls”• “The heart pumps blood to your lungs”
Why are generics hard?• Quantification
• “Acids contain hydrogen.”• Fuzzy quantifiers
• “An HO3+ ion sometimes reacts with three H20 molecules”
• Presuppositions• “HCl dissolves in water.”• “Acids cause some dyes to change color”• “Acid irritates the skin”
• Need background knowledge!• IF an acid touches some skin THEN that skin is irritated”
or more generally• “IF acid + skin are related in way where irritation may
plausibly occur… THEN it will occur.”
5. Needing/Teaching Problem-Solving Knowledge
• Problem-solving knowledge– Chemistry (20%) biology (<5%)
• Worse, is often not even explicit in the text, e.g.:
6. Discourse Context• Can we take sentences in isolation? (“bag of lines”)• Obstacles:
– Pronoun resolution (30% chem, 50% bio)– Context: unqualified compound nouns (most)
• “Every [Bronsted-Lowry] acid has a conjugate [BL] base”• “The [human] heart…The [human] arteries…”
– Other dependencies (15% chem, <5% bio)• “Therefore, HX is the Bronsted-Lowry acid”
• “The other conjugate acids are HS-, PH3 and CO32-”
7. Major Representational Challenges• Hard to quantify: ~70% chem, ~40% bio• Potentiality:
– an acid is a substance (molecule or ion) that can donate a proton to another substance. Likewise, a base is a substance that can accept a proton.”
• Conveying a proof:
• Imprecision and comparatives:– “About 37% by mass”– “Interacts strongly”– “The aorta is the largest artery in the body”
8. Math/Algebraic models
• ~65% chemistry use or manipulate formulae• “NaOH dissociates into Na+ and OH- ions.”• “An H+ ion is simply a proton with no electrons”• “HX and X- differ only in the presence of a proton”
• Challenges– Relating the symbol system to the real world– Defining and apply operations on the symbol system– Relating those operations to the real world
Math/Algebraic models (cont) • Minimal in grade-school biology
– nearest is rates and measures• “the heart contracts 70 times a minute”• “The plasma is 95% water and the other 5% of dissolved
substances”• “In an adult’s body there is 10.6 pints of blood”
9. Loosespeak (metaphor, metonymy, etc)
– Where a “literal interpretion” is incorrect• Ignoring overgenerality
– In these texts, 30% chem, 10% bio• Probably both higher in general
– Chemistry• The molecule, substance, symbol distinction – Huge!• Accounts for ~50% of the complexity of Halo KB.• In other texts (not this one) metaphor also used often
basic-unit“HC2H3O2(aq)+…C2H3O2
-”formula
Loosespeak (metaphor, metonymy, etc)• Biology: metaphor more common
• “Your heart’s job is to pump blood”• “Blood delivers oxygen…On the return trip, the
blood picks up waste products”
Loosespeak (metaphor, metonymy, etc)
• Analysis by Univ Texas at Austin (chemistry)– Loosespeak is everywhere!
Relative Frequency of Phenomena
0
10
20
30
40
50
60
70
80
90
100
idio
ms
gene
rics
repr
esen
tatio
n
alge
bra
loos
espe
ak
disc
ours
e
prob
lem
-so
lvin
g
exam
ples
diag
ram
s
AP Chemistry Grade-School Biology
Relative frequency of phenomena
idioms examples diagramsgenerics problem-solving discourserepresentation math loosespeak
AP Chemistry (5 pages) Grade-school biology (5 pages)
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
Some acids are better proton donors than other acids.Some bases are better proton acceptors than other bases.The conjugate base of a strong proton donor is a weak proton acceptor.The conjugate acid of a strong proton acceptor is a weak proton donor.A stronger acid has a weaker conjugate base.A stronger base has a weaker conjugate acid.A stronger acid is a better proton donor.A stronger base is a better proton acceptor.
Original English
CPL (like)How do we bridge the gap?
From Original English to CPL - 1
• Resolve “others” to mean “other acids/bases”• Use “likewise” to guide a parallel construction• Need to represent “some,” “other,” “better”• Assumes a scale of ability to donate/accept
From Original English to CPL – 2a
• Need to interpret “If we do X, we find that Y” as a mental exercise that draws a conclusion
• Need to have a concept of an ordering based on some ability (to donate a proton)
• Resolve “their ability” back to types of acids• Resolve quantification – one proton per instance of acid
molecule
From Original English to CPL – 2b
• Here “a substance” means an acid molecule• Need to handle jumps between substance-level and molecule-level
references• Need to interpret “the more readily an A does B, the less readily a
C does D”• Need a model of two qualitative scales of ability, with an inverse
relationship• Resolve “its conjugate base” back to the acid
From Original English to CPL – 3
• “Similarly” is a cue for a parallel construction• Other issues are the same as in the previous sentence
(inverse qualitative scales)
From Original English to CPL – 4a
• “In other words” is a cue for another view of the same knowledge in the previous sentence
• “the more readily an acid gives up a proton” = “the stronger an acid”
• Related qualitative scales again• “the stronger an acid” is special syntax
From Original English to CPL – 4b
• Semicolon here denotes parallel constructions• This is also another view of the same knowledge in the
previous sentence• “the more readily a base accepts a proton” = “the stronger
a base”• Inverse qualitative scales again
Overall Interpretation (sketch)
Acid readily
gives up a proton
Acid strength
Conjugate base
readily accepts a
proton
Conjugate base
strength
“In other words”:
“Similarly”: replace acid with base, replace conjugate base with conjugate acid
inverse
inverse
parallel parallel
From Original English to Inference-Supporting Logic: Knowledge Requirements• Discourse Knowledge:
– Pragmatic knowledge for pronoun resolution– Ability to recognize and match parallel constructions
• E.g., with cue words• Both within and across sentences
– Ability to recognize a mental exercise (“if we do …”)
• Domain Knowledge:– Models of qualitative scales and relationships between two
scales– Knowledge to handle substance/molecule metonymy– Models of abilities & give/receive– World knowledge to help resolve quantification
• e.g., one proton per molecule makes most sense
Grade-school Biology
• Searched the Web, found 4 simple texts about the human heart and its function
• They are much simpler than our college chemistry text, but still exhibit lots of interpretation issues
• Only a few sentences from each text happened to be in pure CPL syntax
• By the time science is taught in school, the students are beyond the Dick & Jane reading level
Grade-school Biology Syntax - 1
• Pronouns are everywhere– “Your heart is divided into two sides.” [anyone’s heart]
• Dependent clauses are common– “As blood begins to circulate, it leaves the heart …”– “… fresh oxygen that we have inhaled …”
• Conjunctions appear between various expressions
– “… the vessels and the muscles that help and control …”– “Lizards don’t have hair or feathers … and can’t sweat …”
• Comparatives are common– “The tubes that more gently drain back to the heart …”
• Approximations are common– “… some 70 or so times a minute at rest …”
Grade-school Biology Syntax - 2• Negatives are sometimes used
– They do not work on their own, but together as a team.”
• Phrases often modify other terms– “The blood leaving the aorta is full of oxygen.”
– “On its way back to the heart, the blood travels …”
• Infinitives are sometimes used– “This is important for the cells … to do their work.”
• Parenthetical expressions are sometimes inserted– “… the carbon dioxide (a waste product) is removed ...”
– “… times a minute – more if you are exercising – and …”
Grade-school Biology Syntax - 3
• Rhetorical questions to the reader– “Did you know that your heart is the strongest muscle?”
• Modals are sometimes used– “… so that your body can get rid of them.”
– “… your blood vessels could circle the globe 2 ½ times!”
• Phrases about what something is called– “… a colorless liquid called plasma.”
• Omitted words– “… the other two [cavities] are called ventricles …”
• Adverbs, complex phrases, and other minor issues
Grade-school Biology Semantics
• Analyzed sample grade-school biology texts about the heart and circulation
• What commonsense knowledge is needed to correctly understand the text?– What pump-primed models would be needed?– What underlying knowledge could come from bootstrapping?
• As from tuple extraction from general texts
• Rhetorical question – skip “Did you know that”
• “your heart” = a person’s heart (anatomy context)
• “strongest muscle” [in same body] (anatomy context)
• Build in pragmatics of reading for an anatomy context
• Knowledge: basic anatomy (bootstrapped)
• “divided into” = partitioned (word sense for anatomy)
• “two sides” = two compartments (anatomy/container)
• Knowledge:
• Container/compartments model (pump-primed)
• “right side” = [of the heart] (model of left/right parts)
• “pumps blood” = continuous process (anatomy)
• “to your lungs” could mean it fills up the lungs!
• what is “it”? – right side, or blood, or lungs?
• “picks up” = metaphor for absorbs (anatomy context)
• Knowledge:
• Containers, pumps, liquids (pump-primed)
• “left side” = [of the heart] • “oxygen-soaked blood” – but a liquid is already wet
– Would like a model of blood cells, soaked in oxygen (fluid)– Not provided here, so just assume blood absorbed oxygen– Resolves previous sentence: pronoun “it” = blood
• Knowledge:• model of left/right parts (pump-primed)•“out” - liquid flow in & out of containers (pump-pr.)
• “They” = the two sides of the heart (difficult)
• Rely on discourse pragmatics
• Knowledge:
•“work on their own” vs. “together as a team”
• Doing something alone vs. cooperating in an effort
• “The body’s blood” = all its blood as a single blob
• Knowledge:
•“circulated through” - model of closed fluid circulation
•“1,000 times per day”- model of repeated events per time period
• “five and six thousand” = 5 ≤ x ≤ 6,000?• Use pragmatics to get: 5,000 ≤ x ≤ 6,000
• “pumped each day” -- by which side? Or both sides?
• Could pose question: How much blood does a body contain? – 5 to 6 quarts (inference needed)
• Knowledge:
• Fluid flow, iteration, time periods
• “your fist” -- interesting object, involves a pose
• Knowledge:
•“about the same size as” – model of comparative sizes
Summary of Biology Semantics • Pragmatics for an anatomy context• Pump-primed models:
– Container & compartments & left/right parts– Continuously repeated biological events– Pumps & liquids & closed circulation– Working together vs. alone– Body parts in poses & comparative sizes
• Bootstrapped models:– Basic anatomy
• Some difficult pronoun resolutions
Grade-School Biology Conclusions
• Lots of pump-primed knowledge needed
• Bootstrapped knowledge can help
• Even grade-school texts have significant challenges
• Pragmatics need to be built in to NLP engine
• Is still substantially easier than AP chemistry!
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
Dimensions of Difficulty
Complexityof Knowledge
Educational Level of Text
Grade-school CollegeElementary
Grade-school biology
AP Chemistry
Two Dimensions of DifficultyDimension 1: Domain
• Chemistry (hardest)– Algebraic manipulation, chaining, procedures– Not so much “common sense”
• Physics– Map situations onto a few equations
• Biology (easiest)– Memorize and compare structures and functions
Two Dimensions of DifficultyDimension 2: Educational Level
• College level (hardest):– Sophisticated writing styles– Often includes mathematical abstractions– Attempts to challenge the student– Problem-solving
• Grade-school level (easier):– Simpler sentence structures– Teaches common world knowledge– No/little mathematics– Learning basic facts
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
II: Integrating Knowledge
KnowledgeIntegration
Introspection
NaturalLanguageProcessing
TestGeneration
This seedling
Knowledge Integration:Principles for an Extensible KB
• The Halo KB was not easily extensible• What should it have looked like?
Five Principles for an Extensible KB1. Need Metonymy-Tolerant Repns
The precision that logic requires of our writtenrepresentations is a fundamental barrier to robustness
IF “the acid on the left” is stronger than “the acid on the right”THEN the reaction direction is “to the right”
“the acid denoted by the formula on the left side of the equation of the reaction”
• Alternative: – Preserve metonymy in the KB– Have it resolved at reasoning time
(every Compare-Relative-Strengths-of-Acids has (output ((if (((the1 of (the value of (the intensity of
(the Acid-Role plays of (the first of (the input of Self)))))) = *strong) and((the1 of (the value of (the intensity of (the Acid-Role plays of (the second of (the input of Self)))))) /= *strong))
then (the first of (the input of Self)))))
(every Compare-Relative-Strengths-of-Acids has (output ((if ((the intensity of (the first of (the Chemicals)) = *strong)
and ((the intensity of (the second of (the Chemicals)) /= *strong)then (the strongest of (the Chemicals)) = (the first of (the Chemicals)))))
1. Metonymy-Tolerant Repns (cont)
if we had a metonymy-tolerant reasoner, we could instead write…
1. Metonymy-tolerance: Need Background Knowledge!
• Mixing chemical, molecular, and formula views• Need background K to untangle the mess
basic-unit“HC2H3O2(aq)+…C2H3O2
-”
formula
Note the fluidity of reference in written English!!!
2. Need to Separate Declarative and Procedural Knowledge
input: a Base-Chemicaloutput: convert Chemical → Molecule → Formula, append “H”, then → Molecule’ → Acid-Chemical
Procedural: (Conjugate-Acid calculation)
Declarative:
Acid-Chemical = Base-Chemical + H
+ constraint reasoner to solve constraints
2. Need to Separate Declarative and Procedural Knowledge (cont)
“Every acid has a conjugate base, formed by removing a proton from the acid. ... Similarly, every base has associated with it a conjugate acid, formed by adding a proton to the base.”
Acid-Chemical = Base-Chemical + H
The English text often doesn’t help…
3. Syntactic Organization Matters!• Elaboration tolerance:
– Add/modify knowledge (semantics) by (only) adding formulae (syntactics)
(every Acid-Role has (intensity ( (a Intensity-Value with (value (
(:pair ;; Case statement for Acids. (if ((the played-by of Self) isa Ionic-Compound-Substance) then (if (((the played-by of Self) isa HCl-Substance) or
((the played-by of Self) isa HBr-Substance) or ((the played-by of Self) isa HI-Substance) or ((the played-by of Self) isa HClO3-Substance) or ((the played-by of Self) isa HClO4-Substance) or ((the played-by of Self) isa H2SO4-Substance) or ((the played-by of Self) isa HNO3-Substance)) then *strong else
Not elaboration-tolerant
3. Syntactic Organization Matters!
• Better….
intensity(HCl-Substance, *strong)intensity(HBr-Substance, *strong)intensity(HI-Substance, *strong)intensity(HClO3-Substance, *strong)intensity(HClO4-Substance, *strong)intensity(H2SO4-Substance, *strong)intensity(HNO3-Substance, *strong)…intensity(HF-Substance, *weak)intensity(HC2H3O2-Substance, *weak)intensity(H2CO3-Substance, *weak)…
Elaboration-tolerant
4. Use a linguistically motivated ontology• Key: mapping from English words/phrases to
knowledge-base concepts• Good: Words and concepts match easily:
• Less good: Linguistic concepts are missing
• Even worse: Different conceptual view in the KB
HCl-Substance ↔ “HCl” Easy
Direction of equilibrium: Attached to reaction, not eqn, in KB
*strong/*weak/*negligible ↔ “HCl is stronger than H2O”
4. Use a linguistically motivated ontology
• Key: mapping from English words/phrases to concepts
• Good: Words and concepts match easily:
– HCl-Substance ↔ “HCl”
• Less good: Linguistic concepts are missing
– strong/weak/negligible↔“HCl is stronger than H2O”
• Even worse: Different conceptual view in the KB
– Direction of equilibrium: • Attached to reaction, not eqn, in KB
5. Need Error-Tolerant Reasoning
• KM can go belly-up with a contradiction• Rather need to detect and correct contradictions
– Detect: • explore (ruminate), not just myopic backchaining• richer background knowledge
– Correct:• reasoner supports suspension of assumptions/rules (TMS?)• search mechanism to control this
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary: Findings, Products, and Recommendations
Knowledge Mining
There is a largely untapped source of general knowledge in texts, lying at a level beneath the explicit assertional content, and which can be harnessed.
“The camouflaged helicopter landed near the embassy.” helicopters can land helicopters can be camouflaged
Schubert’s Conjecture:
Our attempt: “lightweight” LFs generated from ReutersLF forms: (S subject verb object (prep noun) (prep noun) …) (NN noun … noun) (AN adj noun)
Knowledge Mining
HUTCHINSON SEES HIGHER PAYOUT. HONG KONG. Mar 2.Li said Hong Kong’s property market remains strong while its economy is performing better than forecast. Hong Kong Electric reorganized and will spin off its non-electricity related activities. Hongkong Electric shareholders will receive one share in the new subsidiary for every owned share in the sold company. Li said the decision to spin off …
Newswire Article
Shareholders may receive shares.
Companies may be sold.
Shares may be owned.
Implicit, tacit knowledge
Knowledge Mining – our attempt
;; Atoms can combine(S "atom" "combine")
;; For example, combustion reactions are redox reactions because elemental oxygen is converted to compounds of oxygen (Section 3.2).(S "reaction" "be" "reaction")(S-ADJ "oxygen" "converted" ("to" "compound"))(AN "elemental" "oxygen")
;; Plan: Metals react with acids to form salts and gas.(S "metal" "react" (PP "with" "acid"))
;; Extensive oxidation can lead to the failure of metal machinery parts or the deterioration of metal structures.(S "oxidation" "lead" (PP "to" "failure"))(S "oxidation" "lead" (PP "to" "deterioration"))(AN "extensive" "oxidation")
Fragment of the raw data (Brown & Lemay)
Agenda• Introduction• Recap – The Story So Far• The “Knowledge Gap”
– Overview– Characterization and analysis– Quantification
• Two Case Studies– AP chemistry– Grade-school biology
• Dimensions of Difficulty• Principles for an Extensible KB• Knowledge Mining• Summary
Summary: Overall Findings and Products• CPL: two formulations
– "naive CPL": 275 sentences– rule-language CPL: ~15 complex rules
• CPL language interpretation algorithm
• Understanding Language– Characterization and quantification of the main challenges– Detailed case studies on the five pages
• Integrating Knowledge– Characterization of the main challenges– Set of principles for overcoming them– Study and algorithms for some of them
• Bridging the Gap: Useful conceptual framework
• Text Mining– 2 tuple databases: 15k chemistry, 25k biology
Summary: Recommendations for Mobius
• Significant work needed on– math/symbol manipulation– handling generics– idiomatic words/phrases– Loosespeak
• Cycle, not just bottom-up/top-down!• Discourse structure needs to be taken seriously
– Not just individual sentences• Need some radical KB changes
– extensible units of knowledge, not intertwined structures– Error-tolerant/Robust reasoning