ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.

30
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population

Transcript of ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.

ONTOLOGY LEARNING AND POPULATION FROM FROM TEXTCh8 Population

Population• Population of ontology:

• Finding instances of relations as well as of concepts• Requires full understanding of natural language

• More modest target:• The extraction of a set of predefined relations

• In this chapter:• No acquisition of instances of relations• The detection of instances of concepts

Population• Common Approaches

• Corpus-based Population• A standard similarity-based approach

• Learning by Googling• Semi-supervised approach• PANKOW• C-PANKOW

Common Approaches• Lexico-syntactic Patterns

• Hearst patterns

• Similarity-based Classification• Algorithm12• Data sparseness problem

• Supervised Approaches• Predict the category of a certain instance with a model• Requires thousands of training examples to train the model• Not feasible - considering hundreds of concepts as possible tags

Similarity-based Classification of Named Entities

• Using different similarity measures• Cosine, Jaccard, L1 norm, Jensen-Shannon, Skew

• Using different feature weighting measures • Conditional, PMI, Resnik

Evaluation

• Goal: learn a function fs

• fa and fb: specified by two annotators

• Functions as sets:

• Measurement• Precision, Recall, F-measure, learning accuracy

Experiments• Using Word Windows

• n words to the left and right of a word of interest• Excluding stopwords without trespassing sentence boundaries

• Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta.

• Mopti: traditional(l), biggest(1)Niger: city(l), delta(l), view(l)Gao: San(l), ofFer(l), town(l), junction(l)

San: offer(l), view(l), Gao(l), nice(l)

Experiments• Result:

Experiments• Result:

Experiments• Using Pseudo-syntactic Dependencies

• Object-attribute pair• Mopti is the biggest city along the Niger with one of the most vibrant

ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta.

• Mopti: is-city(l), has_ambience(l)

Niger: has_delta(l) Gao: junction.of(l) San: offer_subj(l)

• Result:

Experiments• Dealing with Data Sparseness

• Using Conjunctions• When two named entities linked by conjunctions

• Result:

Experiments• Dealing with Data Sparseness

• Exploiting the Taxonomy• Compute the context vector of a certain term by considering the context

vectors of its subconcepts • Take only into account the context vectors of direct subconcepts• Normalizing aggregated vectors:

• Standard normalization of the vector• Calculating its centroid

Experiments• Dealing with Data Sparseness

• Exploiting the Taxonomy• Result:

Experiments• Dealing with Data Sparseness

• Anaphora Resolution• Replace each anaphoric reference to the corresponding antecedent

• The port capital of Vathy is dominated by its fortified Venetian har- bor. • The port capital of Vathy is dominated by Vathy's fortified Venetian harbor.

• Result:

Experiments• Dealing with Data Sparseness

• Downloading Documents from the Web• Downloading 20 additional documents Di for each named entity i

• keep d that its similarity is over an threshold of 0.2• Result:

Experiments• Dealing with Data Sparseness

• Post-processing• The k best answers of the system are checked for their statistical

plausibility on the web• Result:

PANKOW• Pattern-based Annotation through Knowledge on the Web

• Certain lexico-syntactic patterns as defined by Hearst can be matched in corpus AND World Wide Web

PANKOW• The Process of PANKOW

• Step 1: iterates the set of entities to be classified and generates instances of patterns, one for each concept in the ontology. • For example: instance - South Africa, concepts – country and

resulting in pattern instances - ' 'South Africa is a country" and ' 'South Africa is a hotel" or "countries such as South Africa" and "hotels such as South Africa".

• Result 1: A set of pattern instances• Step 2: Google is queried for the pattern instances through its Web

service API• Result 2: the counts for each pattern instance• Step 3: sums up the query results to a total for each concept. • Result: The statistical web fingerprint for each entity, that is, the

results of aggregating for each entity the number of Google counts for all pattern instances conveying the relation of interest.

PANKOW• The Process of PANKOW

PANKOW• Evaluation

• From the two annotators• Reference standards for subject A and B

• Measurement:• Precision, recall, and F-measure

PANKOW• Evaluation

• Measurement:• Average the results for both annotatores

PANKOW• Result:

C-PANKOW• Shortcoming of PANKOW

• A lot of actual instances of the pattern schema are not found

• Large number of queries sent to the Google Web API

• Not scale to larger ontologies

C-PANKOW• C-PANKOW Process

• the web page to be annotated is scanned for candidate instances.• for each instance i discovered and for each clue-pattern pair in our

pattern library P, an automatically generated query is issued to Google and the abstracts or snippets of the n first hits are downloaded.

• Then the similarity between the document to be annotated and the downloaded abstract is calculated. If the similarity is above a given threshold t, the actual pattern found in the abstract reveals a phrase which may possibly describe the concept that the instance belongs to in the context in question.

• The pattern matched in a certain Google abstract is only considered if the similarity between the original page and this abstract is above a given threshold. In this way the pattern-matching process is contextualized.

• Finally, the instance i is annotated with that concept c having the largest number as well as most contextually relevant hits.

C-PANKOW• C-PANKOW Process

C-PANKOW• Evaluation

• Same dataset and evaluation measures as PANKOW • BUT the C-PANKOW uses the 682 concepts of the pruned Tourism

ontology as possible tags • Added learning accuracy

C-PANKOW• Result:

C-PANKOW• Result:

C-PANKOW• Result: