Extracting Symbolic Knowledge From The Web Ofer Neiman.

Extracting Symbolic Knowledge From The Web

Ofer Neiman

The Problem

• The WWW contains information which is easily understandable to humans , but less understandable to computers :

• Web Pages contain mostly text , images and sounds. These can be immediately processed by humans, but the information is not necessarily arranged in the optimal manner for automatic problem solving by computers.

The Web->KB Project’s Long Range Goal

• Create a computer understandable knowledge base whose content mirrors that of the WWW.

• This Knowledge Base would consist of assertions in symbolic form.

Web->KB Goal (ctd.)

• At the minimum, the KB would allow more sophisticated queries than keyword-based search engines.

The 1st Step : A Machine Learning Approach

• Develop a TRAINABLE system , that can be taught to extract information.

• The inputs to the system: 1) An Ontology specifying classes in hierarchical tree form and relations between these classes. Class Examples: Student , Person, Research.Project Relation Examples: Advisor.Of(Instructor,Student) ,

Projects.Led.By(Project,Researcher).

A Machine Learning Approach (ctd.)

• 2) (2nd input) : Training Examples that represent instances of the classes and relations.Given the ontology and training examples, the system is expected to extract from the web NEW instances of classes and relations.

Some Simplifying Assumptions

• 1) Instances of ontology classes are represented by single web pages (‘home pages’).

• 2) Instances of relations R(A,B) are represented in one of 3 ways:

- a segment of text connecting the segment representing A to the segment representing B . - a contiguous segment of text representing A that contains a segment representing B .

Assumptions (ctd.)

• Example: Jim’s home page contains the words ‘Intro To AI’

, so courses.taught.by(Jim,Intro2AI) holds

• - The segment representing A is related to B because it fits some pattern .

Example: Jim’s page contains words typical of AI researchers , so the relation Research.Of (Jim, AI) holds.

The Goal In More Specific Terms

• 1. Recognizing class instances by classifying segments of hypertext

• 2. Recognizing relation instances, mostly by classifying hyperlinks

A KB for CS Departments

• 2 sets of pages from CS departments were drawn, each one with more than 4000 pages .

• The 1st set was drawn from four departments, and the 2nd from many departments.

• The classes and relation instances were hand labeled.

• The idea is to learn instances from one department using the data from other departments as training input.

1st Method For Recognizing Class

Instances- A Statistical Model

• From the labeled data, each class A is assigned a vector of probabilities W(A):

W(A)=( W1,……….W|vocabulary|

) Wi = the frequency of word i in a page representing class A.

• The working vocabulary size is finite (2000 key words)

Recognizing Class Instances- A Statistical Model

(ctd.)

• A new page P is assigned to the class which is most probable given the distribution of words in P (The class with minimal distance ).

• The calculation is done using a variant of the KL divergence, which measures the distance between distributions:

D(P||Q) = Sigma x [ P(x) * log (P(x)/q(x)) ]

A Statistical Model (ctd.)• Evaluating the results, according to two criteria:

1) Coverage - The percentage of pages

from a given class that are correctly classified as belonging to that class

2) Accuracy - The percentage of pages classified to a class that are actually members of that class

• There’s a natural trade-off between the two. • We can raise or lower coverage (accuracy) by

setting a confidence threshold : Page P will be classified to class A only if the distribution in P is sufficiently similar to the distribution in A .

A Statistical Model (ctd.)

• Mistaken Classifications: Since the ontology is hierarchical (Person -> Faculty Or Staff Or Student), a classification of an instance into a more general ancestor class can still be useful.

Some Experimental Results

The Student Class:

Coverage = %20 , Accuracy = %67 Coverage=%80 , Accuracy=%45

The Course Class:

Coverage= %20 , Accuracy= %55

Coverage=%80 , Accuracy=%30

2nd Method For Recognizing Class Instances : Learning First Order

Rules

• Instead of looking just at the word pattern on a given page, we look at the word pattern in neighboring pages as well.

• Example: A page is a course home page if it contains the words ‘textbook’ and ‘TA’, and is linked to a page that contains the word ‘assignment’.

Recognizing Class Instances : Learning First Order Rules ( ctd .)

• We need an algorithm to infer rules in predicate form, where the arguments are pages.

• The rule defines a target class , using basic (atomic) predicates of 2 kinds:

• has_word(Page) - a finite family of predicates where word can be any word, indicating that the page contains the word.

• link_to(Page1,Page2) - there is a hyperlink from page 1 to page 2.

Recognizing Class Instances : Learning First Order Rules (ctd.)

• The input : positive instances of basic predicates and positive and negative instances of the target class.

• The output rules are in terms of the basic predicates (previous slide).

• The algorithm (FOIL) uses a greedy method : At each stage , add to the current definition a basic predicate that excludes from the instances still unaccounted for as many negative examples as possible, while including as many positive examples as possible.

Recognizing Class Instances : Learning First Order Rules (ctd.)

• Example: The Algorithm found the following definition for the class faculty: faculty(A):= has_professor(A), has_ph(A),

link_to(B,A), has_faculty(B)

• Meaning:A page belongs to a faculty member if it contains the words ‘professor’ and ‘ph’ (prefix of ‘phd’) , and there is a link to it from a page containing the word ‘faculty’.

More Experimental Results (for first order rules)

The student class: Coverage= %20 , Accuracy = %90 Coverage = %80 , Accuracy =% 70

Tends to be more accurate than statistical classification, but the coverage is not as good (hard to come up with rules that will make all instances of

the class classified as belonging to the class) .

Recognizing Relation Instances - The main issue to consider is hyperlink connection. - The algorithm is similar to the algorithm for learning 1st order class rules.

- The underlying assumption: a relation between pages is expressed in terms of a hyperlink or a chain of hyperlinks. Therefore , we need predicates whose arguments are pages and hyperlinks.

- The algorithm is applied assuming class instances have already been extracted.

Tjhe studying 1st order r ru rulerules.

T

Recognizing Relation Instances (ctd.)

The basic predicates:

class(Page)

link_to(Hyperlink,Page,Page)

Has_word(Hyperlink)

all_words_capitalized(Hyperlink)

has_alphanumeric_word(Hyperlink)

has_neighborhood_word(Hyperlink)


T


An Example Learned Rule :

members.of.project(A,B) :=

research_project(A) , person(B) , link_to(C,A,D), link_to(E,D,B)

Meaning: The project’s home page A points to an intermediate page D which points to a

personal home page B.


T


Another Rule :

department.of.person(A,B) :=

person(A) , department(B),link_to(C,D,A), link_to(E,F,D),link_to(G,B,F),

has_neighborhood_graduate(E)

Meaning: A 3 hyperlink path from a department to a person, requiring that the

word ‘graduate’ occur near the 2nd hyperlink.


T


Better results than for 1st order class learning rules: The coverage was not perfect

but the accuracy was close to %100.


T

Future Research

1) Relaxation of restrictions, e.g a class may be represented by more than one page.

2) Exploiting HTML structure : there are different kinds of text fields in a page .

References

1) M. Craven et al. , Learning to Extract Symbolic Knowledge from the World Wide

Web, AAAI-98.

2) More Articles: The Machine Learning Group at CMU :

www.cs.cmu.edu/Groups/~TextLearning/

Extracting Symbolic Knowledge From The Web Ofer Neiman.

Documents

Transcript of Extracting Symbolic Knowledge From The Web Ofer Neiman.