23 January 2003 APAN-Fukuoka
Language and Tools for Lexical Resource Management
Asanee Kawtrakul (1)Aree Thunkijjanukij (2)
Preeda Lertpongwipusana(1)Poonna Yospanya(1)
(1)Department of Computer Engineering, Faculty of Engineering, (2) Thai National AGRIS center
Kasetsart University
Acknowledgement
• JIRCUS: Japan International Research Center for Agricultural Sciences
• Organizing committee
• Kasetsart University
Outline
• Background & Motivation
• Problems in Lexical Resource Preparation
• Requirements for Lexical Resource Management
• Proposed Language and tools
• Conclusion and Next steps
Background and Motivation
• Thailand is the agricultural basis country– having a rich knowledge and data in agricultural field,
• A great quantity of agricultural information was scattered in unstructured and unrelated text – Skimming/Digesting and integrating becomes
essential
• Knowledge is around the world– Knowledge Discovery without language barrier is also
needed
The Basic Idea behind..
GatheringModule AgriculturalDocument
collection
Indexingand Clustering
Module
Internet
SummarizationModule
TranslationModule Data Cube
GraphicalUser Interface
Textual Data as a Input
Let us focus on Canada’s agricultural products. In 1998, there were 1,216 registered commercial egg producers in Canada. Ontario produced 39.8% of all eggs in Canada, Quebec was second with 16.6%. The western provinces have a combined egg production of 35.6% and the eastern provinces have a combined production of 8.0%.
With a courtesy of Agriculture and Agri-Food Canada, http://www.agr.ca/cb
Summarization and Translation as a Result
CategoryCategory ExporterExporter YearYear MonthMonth PricePrice UnitUnitPaddy Thailand 2002 January 300 Dollars/Ton
Paddy Thailand 2002 February 285 Dollars/Ton
ประเภประเภทท
ผู้�ส่�งผู้�ส่�งออกออก
ป�ป� เดื�อนเดื�อน ราคาราคา หน�วยหน�วย
ข้�าวเปลื�อก
ประเทศไทย
2545 มกราคม 14,340 บาทต่�อเกว�ยน
ข้�าวเปลื�อก
ประเทศไทย
2545 ก�มภาพั�นธ์�
13,625 บาทต่�อเกว�ยน
The Development of Agricultural System for Knowledge Acquisition and Dissemination
• 5 years Project (2001-2005)
• The Collaborative work between:– Thai National AGRIS center:
• Providing Bilingual Thesaurus (AGROVOC)
– Department of Computer Engineering• Developing NLP techniques for Searching, Summarizing and
Translation including tools for lexical resource management
• Funded by Kasetsart University Research and Development Institution
Acquisition System
Rules Thesaurus Lexicon
Linguist/Domain ExpertVery Large Corpus
DocumentIndexing & Clustering
Linguistic Knowledge Base
Intelligent Search Engine•With Translation
•With Summarization
Document Warehouse
Gathering Module
Internet/Intranet
Thai Agricultural Thesaurus
• Total number of English vocabulary is 27,531 terms
• Translate in to Thai only 10,280 terms (except scientific names)
• Scientific name were not be translated– ex. Oryza (genus) sativa (specy) of rice or
family
Problem in hand-coded Thesaurus
• Scalability
• Reliability and Coherence
• Rigidity
• Cost
Foods
Bakery Product
Deistic Foods
Frozen Foods
Fermented Foods
Processed Products
Canned Products
Dried Products
Frozen Products
Fermented Products
Alcoholic Beverage
milk
Fermented Foods
Fermented Fish
Fermented Fish
Fermented Fish
Foods
Fermented Foods
Processed Products
Local Product
Products
Fermented Fish
Commercial Vegetables: The September index, at 107, was
up 1.9 percent from last month but 3.6 percent below Septe
mber 1998. Priceincreases for lettuce, tomatoes, broccoli, and celery more than offset pricedecreases for onions, carrots, and cucumbers
Commercial Vegetable
tomatoesBroccoli Carrots
Cucumbers
tomatoes
VEGETTABLESBROCCOLI
type=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
CHERRY TOMATOEStype=fruit vegetable
LYCOPERSICON ESCULENTUM
type=taxonomic
BT
NT
RT
SOLANACEAE
CAPSICUMNICOTIANA
BT
NT
Expert DomainExpert Domain
color=red
color=red
tomato
tomatoes
Keyword AssignedKeyword Assigned
Commercial Vegetable
broccoli
carrot
tomato
User CategoryUser Category
Other Major Problems (1)
• Accessing to textual information– Language variation:
• Many ways to express the same idea
Ex: thinning flower uses deblossoming
thinning branch uses pruning
– how the computer can know that words a person uses are related to words found in stored text?Ex: user: thinning branch
computer: pruning
Requirement (1)
• Accessing to textual information
–Need intelligent browsing from related concept to related concept,
rather than from occurrence of stemmed character strings
Other Major Problems (2)
• Transforming from unstructured to structured information
Requirement (2)• Need Application-based Frame about product
price– Knowledge representation in table form– Consisting of attributes and their values
CategoryCategory Paddy
ExporterExporter Thailand
PricePrice 300
UnitUnit Dollars/Ton
Attributes
Values
Problems in Translation: Pragmatic and Semantic
• The September All Farm Products Index was 97 percent of its 1990-92 base, down1.0 percent from the August index and 2.0 percent below th
e September 1998 Index
Using Ontology0.97*
averagePrice of year from1990-
1992
SeptemberOf year ??
AugustYear1997
Down 0.02*price(September 1998)
“Year 1990-1992” meaning
Product Year
A 1990 1991 1992
B - - -
C - - -
D - - -
Requirement (3)
• Lexicon should having the semantic constraints between lexical entities,
restriction on usage categories
Summary of Problems related to lexicon
• In terms of coverage– Extensional coverage, i.e., number of entries– Intensional coverage, i.e., the number of information fields
• In terms of semantic domain covered by the application– Meaning Interpretation with respect to objects, subject
matter, topics of discourse, and pragmatic interpretation
• The user category with reference to the intended system users– Commercial product vs Plant products vs Family
products
One Solution
• Encoding world knowledge in the structures attached to each
lexical item which needs both language and tools
The Design of Lexicon: Requirement Specification
• Macrostructure: Lexicon structure in terms of relations between lexical entries – i.e. Hierarchical taxonomies which are characteristic
of thesauri of semantically related word family
• Microstructure: types of information for each entry– Pronunciation or phonemic transcription– Syntactic properties– Meaning– Pragmatics of their use in real context and language
Microstructure (cont’)
• Lexical entity could contain slots/scripts for each specific domain and need intelligent
Analyzer and understanding language
– Supplies information extraction– Supplies the missing value
Lexical Resource Management Language
• which is able to:
– Handle heterogeneity of linguistic knowledge structures.
– Handle exceptions and inconsistencies of natural languages.
– Provide an intuitive means to store and manipulate both linguistic and
world knowledge.
Language Features
• The language is designed in a way that will enable:– Supports for heterogeneous structures.– Sufficient provisions to handle exceptions and
inconsistencies of natural languages (this is achieved through the +/- operators).
– Deduction of knowledge from rules.– Detection and prevention of potential integrity
violations.
Language and Tools Specification requirement
• Flexibility – almost any structures can be defined in this model.
• Extensibility – extending a structure is simple.
• Maturability – structure reformation and deformation are supported.
• Integrity – meta-relations help prevent malformed or ill-semantic data entries.
• Dealing with inconsistencies is feasible.
Some Syntactic Elements
• Knowledge manipulations are achieved through these primitives:– def is used to define structures not already
existing.– redef changes aspects of existing structures.– undef removes specified structures from the
knowledge base.– ret is used to retrieve structures from the
knowledge base.
Examples
• Hierarchies: tree structures representing generalization semantics, or classes, of atoms.
thing
animate inanimate
animalhuman
A semantic tree represented by a hierarchy structure
Usage Examples
• Defining a hierarchy– def thing(animate(human+animal)+inanimate).
• Adding the ‘plant’ and ‘vehicle’ concepts– def animate(plant+vehicle).
• Reparenting the ‘vehicle’ concept– redef animate(vehicle) inanimate(vehicle).
• Removing the ‘human’ concept– undef human. (provided that there is only a single
instance of ‘human’)
Usage Examples (2)
• Defining case frames for verbs– First, we need to define meta-relations for
words belonging to the sub-hierarchy ‘verb’.– def meta case(verb, sub:thing).– def meta case(verb, sub:thing, obj:thing).– Then, we define case frames for several verbs.– def case(eat, sub:human+animal, obj:food).– def case(fly, sub:bird-penguin). (here, we
emphasize the use of +/- operators)
Hierarchy & Set
c1
w1
w7
c2w2
p1
w6
f1f2 f3
f4
c3
w4
w5
w3
Defining a Hierarchy
c1
w1
w7
c2w2
p1
w6
w4
w5
w3
def c1(“w1”(“w3”)+c2(“w4”)+“w2”).
def “w5”+“w6” under “w4”.
def “p1”(“w7”) under “w2”.
Manipulating the Hierarchy
c1
w1
w7
c2w2
p1
w6
w4
w5
w3
redef “w4” under “w2”.
undef “w1”.
Defining a Set
f1f2 f3
f4
c3
def c3{[f1]+[f2]+[f3]}.
def [f4] in c3.
Defining a Relation
c2
w6
f1f2 f3
f4
c3
w4
w5
def meta r1(c2, c3). Template defined.
r1’
def r1(“w4”, [f1]). Relation defined.
r1
c2
w1
def r1(“w1”, [f3]). Constraint violated.Definition not allowed.
inherited
Synset & Surrogates
• A synset is an unnamed set identified by its unique ID.
• Members of a synset are considered synonymous with different degrees of
synonymity.• Distance graph is automatically constructed
within a synset with surrogates being representatives of synset members.
• Entities with identical features are attached to the same surrogates.
Synset & Surrogates
s1
s4
s2
s3
s5
w2
w1
p2
p3
w3
w4
w6
p1
f2
f4
f4
f3
f3f4
f1
f1f1
f3
f2
f1
f4
synset#1
surrogate network internally constructed
Synset & Multilingual Lexicon
• Synset members are not confined within language scope, that is, entities from different
language may belong to the same synset.• Distance matrix are computed from number of
different features over each pair of surrogates. • Traversing from a word to nearest-distant words
is handled by the system. We can determine words with potentially nearest semantics here.
Expected Result
Keyword GeneratedKeyword Generated
Keyword GeneratedKeyword Generated
“Fruit vegetable”,red
tomatoes
VEGETTABLESBT
Expert DomainExpert DomainKeyword GeneratedKeyword Generated
“Fruit vegetable”,red
tomatoes
VEGETTABLESBT
Keyword GeneratedKeyword Generated
“Fruit vegetable”,red
BROCCOLItype=leaf vegetablecolor=green
Expert DomainExpert Domain
tomatoes
VEGETTABLESBT
Expert DomainExpert DomainKeyword GeneratedKeyword Generated
“Fruit vegetable”,redSweet pepper
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
tomatoes
VEGETTABLESBT
Expert DomainExpert DomainKeyword GeneratedKeyword Generated
“Fruit vegetable”,redSweet pepperTomatoes
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
tomatoes
VEGETTABLESBT
Expert DomainExpert Domain
CHERRY TOMATOEStype=fruit vegetable
NT
color=red
Keyword GeneratedKeyword Generated
“Fruit vegetable”,redSweet pepperTomatoesCherry Tomatoes
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
tomatoes
VEGETTABLESBT
Expert DomainExpert Domain
CHERRY TOMATOEStype=fruit vegetable
NT
color=red
Keyword GeneratedKeyword Generated
“Fruit vegetable”,redSweet pepperTomatoesCherry Tomatoes
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
RTLYCOPERSICON ESCULENTUM
type=taxonomicSOLANACEAE
CAPSICUMNICOTIANA
BT
NTcolor=red
Keyword GeneratedKeyword Generated
Keyword GeneratedKeyword Generated
“Plant in same family”
tomatoes
VEGETTABLESBT
Expert DomainExpert Domain
CHERRY TOMATOEStype=fruit vegetable
NT
color=red
Keyword GeneratedKeyword Generated
“Plant in same family”Capsicum
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
RTLYCOPERSICON ESCULENTUM
type=taxonomicSOLANACEAE
CAPSICUM
BT
NTcolor=red
tomatoes
VEGETTABLESBT
Expert DomainExpert Domain
CHERRY TOMATOEStype=fruit vegetable
NT
color=red
Keyword GeneratedKeyword Generated
“Plant in same family”CapsicumNicotiana
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
RTLYCOPERSICON ESCULENTUM
type=taxonomicSOLANACEAE
CAPSICUMNICOTIANA
BT
NTcolor=red
tomatoes
VEGETTABLESBT
Expert DomainExpert Domain
CHERRY TOMATOEStype=fruit vegetable
NT
color=red
Keyword GeneratedKeyword Generated
“Plant in same family”CapsicumNicotiana
BROCCOLItype=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
RTLYCOPERSICON ESCULENTUM
type=taxonomicSOLANACEAE
CAPSICUMNICOTIANA
BT
NTcolor=red
tomatoes
VEGETTABLESBROCCOLI
type=leaf vegetablecolor=green
SWEET PEPPERtype=fruit vegetablecolor=red, green, yellow
TOMATOEStype=fruit vegetablecolor=red, yellow
CHERRY TOMATOEStype=fruit vegetable
LYCOPERSICON ESCULENTUM
type=taxonomic
BT
NT
RT
SOLANACEAE
CAPSICUMNICOTIANA
BT
NT
Expert DomainExpert Domain
color=red
color=red
tomato
tomatoes
Keyword AssignedKeyword Assigned
Commercial Vegetable
broccoli
carrot
tomato
User CategoryUser Category
Keyword GeneratedKeyword Generated
tomatoTomatoTomatoesCherry Tomatoes
Conclusion and Next steps
• This is a preliminary introduction of the language, with a few of its many possibilities.
• Structures not mentioned in details here have not yet been firmly specified. These
structures are rules, maps, and contexts, which are incorporated to extend the
potentials in handling deductions, multilingual operations, domain-dependent retrievals, etc.
Next Steps
• Revise the Idea• Continue the Implementation
– Aligner Tool– GUI tools for Thesaurus maintenance
• Short - term solutions to language variability problems by exploiting available knowledge sources with available
techniques• Long-range approach need high quality language understanding , i.e., Automatic thesaurus construction
– System of Agricultural Information Summarization and Translation
Thank you
Top Related