prie.ppt
Transcript of prie.ppt
![Page 1: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/1.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Information extraction from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]
rainbow.vse.cz
1 Dept. of Information and Knowledge Engineering, Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
Coupling quantitative and knowledge-based approaches
![Page 2: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/2.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Agenda
• Overview of the Rainbow project• Extraction of product offers
– Annotation using HMMs– Impact of image information– Ontology-based instance extraction– Search interface
• Future work
![Page 3: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/3.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Rainbow overview• Goal
– to present the content and structure of legacy websites to a user or computer agent
• How– multiway analysis of websites: utilize features derived from text,
images, formatting, URLs, navigation structure and background knowledge
• Modular architecture, web services– information extraction (HMMs)– discovery of website navigation structure (link graph)– image classifiers (histograms, dimensions, similarity)– URL classifier (rule-based)– extractor of summarizing sentences (bootstrapped indicator
keywords)
![Page 4: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/4.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Application of Rainbow
![Page 5: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/5.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Extraction of product offers
• Combines– automatic document annotation using HMMs– image classifier– ontology-based instance composition– URL classifier for focused crawling– structured search interface powered by Sesame
• The data– over 1000 bicycle offers (labeled using 15 attributes)– in 100 pages from different websites
![Page 6: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/6.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Sample data
![Page 7: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/7.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Preprocessing
• HTML cleanup– conversion to valid XHTML
• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted
• Formatting tags– attributes removed– several rules matching common constructions (add-to-basket
form, choose-amount button)
• Images– baseline: all images treated as a single token
![Page 8: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/8.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Annotation using HMMs
• HMM structure– target, prefix, suffix and background states– adopted from [Freitag, McCallum 99]
• Single tag trigram model for all tags
• F-measures – 83% for name, 89% for price– 56% average for 13 other attributes (17-90%)
• Variations– word-ngram models for lexical probabilities of target states– state substructures instead of single target states, learned
by EM
![Page 9: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/9.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Impact of image information
• Image classifier– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurence in a document is
substituted by its class– best result 6.6% error rate for binary classification with multi-layer
perceptron (weka)
• Features used for classification– dimensions (estimated 2-dimensional normal distribution)– similarity (latent semantic similarity [Praks 2004] )– whether the same image repeats in the same document
• Results– image precision increased by 19.1%, recall by 2%– improvements for other tags negligible
![Page 10: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/10.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Ontology-based instance extraction
Instance extractionalgorithm
Instances(xml)
SesameRDF
repository
Documentannotatedby HMM
Presentationontology
![Page 11: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/11.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Domain ontology Presentation ontology
![Page 12: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/12.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Instance extraction algorithm
• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,
an empty working_instance is created. The old working_instance is saved only if it is consistent.
1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }
![Page 13: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/13.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Search interface powered by Sesame
![Page 14: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/14.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Future work
• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)
• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute
orderings– use approximate name matching to identify instances
• Improve search interface– approximate name matching (word and char edit distance)
![Page 15: prie.ppt](https://reader035.fdocuments.us/reader035/viewer/2022062707/55860a67d8b42a4b6b8b4c5e/html5/thumbnails/15.jpg)
Machine Learning for the Semantic Web, Feb 14th 2005
Thank you!
rainbow.vse.cz