Extracting Symbolic Knowledge From The Web Ofer Neiman.

28
Extracting Symbolic Knowledge From The Web Ofer Neiman
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    3

Transcript of Extracting Symbolic Knowledge From The Web Ofer Neiman.

Page 1: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Extracting Symbolic Knowledge From The Web

Ofer Neiman

Page 2: Extracting Symbolic Knowledge From The Web Ofer Neiman.

The Problem

• The WWW contains information which is easily understandable to humans , but less understandable to computers :

• Web Pages contain mostly text , images and sounds. These can be immediately processed by humans, but the information is not necessarily arranged in the optimal manner for automatic problem solving by computers.

Page 3: Extracting Symbolic Knowledge From The Web Ofer Neiman.

The Web->KB Project’s Long Range Goal

• Create a computer understandable knowledge base whose content mirrors that of the WWW.

• This Knowledge Base would consist of assertions in symbolic form.

Page 4: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Web->KB Goal (ctd.)

• At the minimum, the KB would allow more sophisticated queries than keyword-based search engines.

Page 5: Extracting Symbolic Knowledge From The Web Ofer Neiman.

The 1st Step : A Machine Learning Approach

• Develop a TRAINABLE system , that can be taught to extract information.

• The inputs to the system: 1) An Ontology specifying classes in hierarchical tree form and relations between these classes. Class Examples: Student , Person, Research.Project Relation Examples: Advisor.Of(Instructor,Student) ,

Projects.Led.By(Project,Researcher).

Page 6: Extracting Symbolic Knowledge From The Web Ofer Neiman.

A Machine Learning Approach (ctd.)

• 2) (2nd input) : Training Examples that represent instances of the classes and relations.Given the ontology and training examples, the system is expected to extract from the web NEW instances of classes and relations.

Page 7: Extracting Symbolic Knowledge From The Web Ofer Neiman.
Page 8: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Some Simplifying Assumptions

• 1) Instances of ontology classes are represented by single web pages (‘home pages’).

• 2) Instances of relations R(A,B) are represented in one of 3 ways:

- a segment of text connecting the segment representing A to the segment representing B . - a contiguous segment of text representing A that contains a segment representing B .

Page 9: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Assumptions (ctd.)

• Example: Jim’s home page contains the words ‘Intro To AI’

, so courses.taught.by(Jim,Intro2AI) holds

• - The segment representing A is related to B because it fits some pattern .

Example: Jim’s page contains words typical of AI researchers , so the relation Research.Of (Jim, AI) holds.

Page 10: Extracting Symbolic Knowledge From The Web Ofer Neiman.

The Goal In More Specific Terms

• 1. Recognizing class instances by classifying segments of hypertext

• 2. Recognizing relation instances, mostly by classifying hyperlinks

Page 11: Extracting Symbolic Knowledge From The Web Ofer Neiman.

A KB for CS Departments

• 2 sets of pages from CS departments were drawn, each one with more than 4000 pages .

• The 1st set was drawn from four departments, and the 2nd from many departments.

• The classes and relation instances were hand labeled.

• The idea is to learn instances from one department using the data from other departments as training input.

Page 12: Extracting Symbolic Knowledge From The Web Ofer Neiman.

1st Method For Recognizing Class

Instances- A Statistical Model

• From the labeled data, each class A is assigned a vector of probabilities W(A):

W(A)=( W1,……….W|vocabulary|

) Wi = the frequency of word i in a page representing class A.

• The working vocabulary size is finite (2000 key words)

Page 13: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Class Instances- A Statistical Model

(ctd.)

• A new page P is assigned to the class which is most probable given the distribution of words in P (The class with minimal distance ).

• The calculation is done using a variant of the KL divergence, which measures the distance between distributions:

D(P||Q) = Sigma x [ P(x) * log (P(x)/q(x)) ]

Page 14: Extracting Symbolic Knowledge From The Web Ofer Neiman.

A Statistical Model (ctd.)• Evaluating the results, according to two criteria:

1) Coverage - The percentage of pages

from a given class that are correctly classified as belonging to that class

2) Accuracy - The percentage of pages classified to a class that are actually members of that class

• There’s a natural trade-off between the two. • We can raise or lower coverage (accuracy) by

setting a confidence threshold : Page P will be classified to class A only if the distribution in P is sufficiently similar to the distribution in A .

Page 15: Extracting Symbolic Knowledge From The Web Ofer Neiman.

A Statistical Model (ctd.)

• Mistaken Classifications: Since the ontology is hierarchical (Person -> Faculty Or Staff Or Student), a classification of an instance into a more general ancestor class can still be useful.

Page 16: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Some Experimental Results

The Student Class:

Coverage = %20 , Accuracy = %67 Coverage=%80 , Accuracy=%45

The Course Class:

Coverage= %20 , Accuracy= %55

Coverage=%80 , Accuracy=%30

Page 17: Extracting Symbolic Knowledge From The Web Ofer Neiman.

2nd Method For Recognizing Class Instances : Learning First Order

Rules

• Instead of looking just at the word pattern on a given page, we look at the word pattern in neighboring pages as well.

• Example: A page is a course home page if it contains the words ‘textbook’ and ‘TA’, and is linked to a page that contains the word ‘assignment’.

Page 18: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Class Instances : Learning First Order Rules ( ctd .)

• We need an algorithm to infer rules in predicate form, where the arguments are pages.

• The rule defines a target class , using basic (atomic) predicates of 2 kinds:

• has_word(Page) - a finite family of predicates where word can be any word, indicating that the page contains the word.

• link_to(Page1,Page2) - there is a hyperlink from page 1 to page 2.

Page 19: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Class Instances : Learning First Order Rules (ctd.)

• The input : positive instances of basic predicates and positive and negative instances of the target class.

• The output rules are in terms of the basic predicates (previous slide).

• The algorithm (FOIL) uses a greedy method : At each stage , add to the current definition a basic predicate that excludes from the instances still unaccounted for as many negative examples as possible, while including as many positive examples as possible.

Page 20: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Class Instances : Learning First Order Rules (ctd.)

• Example: The Algorithm found the following definition for the class faculty: faculty(A):= has_professor(A), has_ph(A),

link_to(B,A), has_faculty(B)

• Meaning:A page belongs to a faculty member if it contains the words ‘professor’ and ‘ph’ (prefix of ‘phd’) , and there is a link to it from a page containing the word ‘faculty’.

Page 21: Extracting Symbolic Knowledge From The Web Ofer Neiman.

More Experimental Results (for first order rules)

The student class: Coverage= %20 , Accuracy = %90 Coverage = %80 , Accuracy =% 70

Tends to be more accurate than statistical classification, but the coverage is not as good (hard to come up with rules that will make all instances of

the class classified as belonging to the class) .

Page 22: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Relation Instances - The main issue to consider is hyperlink connection. - The algorithm is similar to the algorithm for learning 1st order class rules.

- The underlying assumption: a relation between pages is expressed in terms of a hyperlink or a chain of hyperlinks. Therefore , we need predicates whose arguments are pages and hyperlinks.

- The algorithm is applied assuming class instances have already been extracted.

Tjhe studying 1st order r ru rulerules.

T

Page 23: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Relation Instances (ctd.)

The basic predicates:

class(Page)

link_to(Hyperlink,Page,Page)

Has_word(Hyperlink)

all_words_capitalized(Hyperlink)

has_alphanumeric_word(Hyperlink)

has_neighborhood_word(Hyperlink)

Tjhe studying 1st order r ru rulerules.

T

Page 24: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Relation Instances (ctd.)

An Example Learned Rule :

members.of.project(A,B) :=

research_project(A) , person(B) , link_to(C,A,D), link_to(E,D,B)

Meaning: The project’s home page A points to an intermediate page D which points to a

personal home page B.

Tjhe studying 1st order r ru rulerules.

T

Page 25: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Relation Instances (ctd.)

Another Rule :

department.of.person(A,B) :=

person(A) , department(B),link_to(C,D,A), link_to(E,F,D),link_to(G,B,F),

has_neighborhood_graduate(E)

Meaning: A 3 hyperlink path from a department to a person, requiring that the

word ‘graduate’ occur near the 2nd hyperlink.

Tjhe studying 1st order r ru rulerules.

T

Page 26: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Recognizing Relation Instances (ctd.)

Better results than for 1st order class learning rules: The coverage was not perfect

but the accuracy was close to %100.

Tjhe studying 1st order r ru rulerules.

T

Page 27: Extracting Symbolic Knowledge From The Web Ofer Neiman.

Future Research

1) Relaxation of restrictions, e.g a class may be represented by more than one page.

2) Exploiting HTML structure : there are different kinds of text fields in a page .

Page 28: Extracting Symbolic Knowledge From The Web Ofer Neiman.

References

1) M. Craven et al. , Learning to Extract Symbolic Knowledge from the World Wide

Web, AAAI-98.

2) More Articles: The Machine Learning Group at CMU :

www.cs.cmu.edu/Groups/~TextLearning/