prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Information extraction from HTML product catalogues

Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1

{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept. of Information and Knowledge Engineering, Prague University of Economics

2 Dept. of Applied Mathematics, Technical University of Ostrava

Coupling quantitative and knowledge-based approaches


Agenda

• Overview of the Rainbow project• Extraction of product offers

– Annotation using HMMs– Impact of image information– Ontology-based instance extraction– Search interface

• Future work


Rainbow overview• Goal

– to present the content and structure of legacy websites to a user or computer agent

• How– multiway analysis of websites: utilize features derived from text,

images, formatting, URLs, navigation structure and background knowledge

• Modular architecture, web services– information extraction (HMMs)– discovery of website navigation structure (link graph)– image classifiers (histograms, dimensions, similarity)– URL classifier (rule-based)– extractor of summarizing sentences (bootstrapped indicator

keywords)


Application of Rainbow


Extraction of product offers

• Combines– automatic document annotation using HMMs– image classifier– ontology-based instance composition– URL classifier for focused crawling– structured search interface powered by Sesame

• The data– over 1000 bicycle offers (labeled using 15 attributes)– in 100 pages from different websites


Sample data


Preprocessing

• HTML cleanup– conversion to valid XHTML

• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted

• Formatting tags– attributes removed– several rules matching common constructions (add-to-basket

form, choose-amount button)

• Images– baseline: all images treated as a single token


Annotation using HMMs

• HMM structure– target, prefix, suffix and background states– adopted from [Freitag, McCallum 99]

• Single tag trigram model for all tags

• F-measures – 83% for name, 89% for price– 56% average for 13 other attributes (17-90%)

• Variations– word-ngram models for lexical probabilities of target states– state substructures instead of single target states, learned

by EM


Impact of image information

• Image classifier– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurence in a document is

substituted by its class– best result 6.6% error rate for binary classification with multi-layer

perceptron (weka)

• Features used for classification– dimensions (estimated 2-dimensional normal distribution)– similarity (latent semantic similarity [Praks 2004] )– whether the same image repeats in the same document

• Results– image precision increased by 19.1%, recall by 2%– improvements for other tags negligible


Ontology-based instance extraction

Instance extractionalgorithm

Instances(xml)

SesameRDF

repository

Documentannotatedby HMM

Presentationontology


Domain ontology Presentation ontology


Instance extraction algorithm

• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,

an empty working_instance is created. The old working_instance is saved only if it is consistent.

1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }


Search interface powered by Sesame


Future work

• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)

• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute

orderings– use approximate name matching to identify instances

• Improve search interface– approximate name matching (word and char edit distance)


Thank you!

rainbow.vse.cz

prie.ppt

Documents

Transcript of prie.ppt