Semantic Relatedness of Web Resources by XESA - Philipp Scholl

KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)

Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)

TUD – Technische Universität Darmstadt Rundeturmstr. 10, D-64283 Darmstadt, Germany

Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de

Dipl.-Inform. Philipp SchollDoreen Böhnstedt, M.Sc.Dipl.-Inform. Renato Domínguez GarcíaDr.-Ing. Christoph RensingProf.Dr.-Ing. Ralf Steinmetz

Philipp.Scholl@KOM.tu-darmstadt.de Tel.+49 6151 166115

10. April 20232010-10-01 EC-TEL Presentation Scholl.ppt

Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web

Resources

Presentation 2010/10/01 EC-TEL, Barcelona

Recommendation

WPWPWPWPWPWP

KOM – Multimedia Communications Lab 2

Outline

A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook

Scenario: Crokodil

Crokodil – supporting Resource based Learning with Web Resources Collecting Fragments of Web Resources

(“Snippets”) Organize Snippets via (semantic) tagging (with

types Person, Event, Goal, Location, …) Underlying structure: Personal and Community

Knowledge Networks

Embedded as an add-on into the sidebar of the web browser Firefox

Study Results: Snippets of Web Resources

Participants of study [SBB09] found saving fragments of web resources (instead of whole web pages) very useful

Snippets ≡ Fragments of web resources Definite, narrow scope of topic Cover user’s information needs

Findings in Study [SBB09] Comparison 1357 snippets vs.

705 web resources Snippets: 70% smaller than 100 words Web resources: 70% smaller than 1000 words

Comparison: Snippets vs. HTML Pages

1 10 100 1000 10000 100000

Size in words / tokens

Snippets Complete HTML Pages

[SBB09] Scholl, P., Benz, B. F., Böhnstedt, D., Rensing, C., Schmitz, B., Steinmetz, R. (2009): Implementation and Evaluation of a Tool for Setting Goals in Self-Regulated Learning with Web Resources, In: Learning in the Synergy of Multiple Disciplines, EC-TEL 2009, pp. 521-534, Springer-Verlag Berlin/Heidelberg

Structural Recommendations

Suggesting related resources in Crokodil: based on structure of knowledge network

Whether the resource has already been saved in the personal or community knowledge network

Based on explicit connections between current web resource and tags

Blog entry: Visualization of Learning with Web 2.0

Paper excerpt: Social Network Analysis and Visualizations for Learning

Web 2.0Life long learning

Recommendation

EC-TEL 2010E-Learning TEL

Challenge: Sparse Knowledge Networks

Direct, explicit connections do not always exist

Knowledge Networks are sparse

Goal: semantic recommendation based on snippets.

Some measure of similarity / relatedness between snippets is needed for recommendation

Blog entry: e-learning in Web 2.0Paper excerpt: Web 2.0 for learning

E-learningRecommendation

Implications for Recommending Snippets

Snippets Are mostly short Have only few significant terms Learning scenario needs recommendation of related, not necessarily similar snippets

Semantic Relatedness vs. Semantic Similarity

Challenge: Vocabulary gap Different wording and terminology

Only marginally similar in terminology, but semantically strongly related

Naïve Bag-Of-Words approach not feasible for comparison

One approach to accomodate these properties: Explicit Semantic Analysis

“TEL refers to the assistance of activities in knowledge acquisition through technology”

“E-Learning comprises all forms of

electronically supported

learning and teaching.”

Outline

Calculates relatedness between words / text [GM07] Based on reference corpus containing semantically distinct documents Allows comparison between conceptualized abstractions of documents

Resulting semantic vector iesa can be compared to other vectors (e.g. by cosine measure)

Base Approach: Explicit Semantic Analysis

x =|terms|×1 n×1

document d1n documents from corpus

Preprocessing steps*

Semantic interpretation Matrix Mint

* Contain:1. Tokenization2. Stemming3. Calculation of TF-IDF

Semantic interpretation vector iesa

n×|terms|n 1×|terms| vectors

document d2

comparison

[GM07] Gabrilovich, E. & Markovitch, S. (2007): Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6-12

Wikipedia as Reference Corpus

ESA commonly uses Wikipedia as feasible reference corpus

Wikipedia: Collaboratively edited encyclopedic knowledge

German Wikipedia: 1 Mio. articles Each article corresponds to a semantic concept (topic) Articles are densely interconnected by Wiki-Links

German Wikipedia: 25 Mio. links Articles are semantically grouped into categories

German Wikipedia: 122k categories Articles are connected to corresponding / similar articles in

other languages (266 languages available)

Observation and Hypothesis

Observation: ESA only considers article text Ignores semantic information contained in Wikipedia that can be used:

Connectivity by links Category information

Implement different enhancements by semantic enrichment: eXtended Explicit Semantic Analysis (XESA)

Hypothesis: Semantically enriching interpretation vector by using this additional

information readily provided by Wikipedia enhances task of comparing snippets

Outline

XESA – Overview

ESA XESAAG XESACAT XESAAG+CAT

Article content

Article Graph

Category Information

Article Graph Extension

Additional factors (not shown here): Article-Link weight wlink-weight – determines weight of Article Graph selectBestN – selection of only n best values of iesa for complexity

reduction

Albert Einstein

Gravitation

Matter

Curvature

Black Hole

General Relativity

Catholic SchoolJewish

Article GraphMatrix A

|articles|×1

Semantic interpretation

vector iesa

x|articles|×|articles|

=|articles|×1

iesa_AG

Category Graph Extension

As categories are appended concept space, the resulting interpretation vector has more dimensions

General RelativityMisner SpaceAnti-Gravity Atom Heat

Fundamental Physics Concepts Concepts of Heaven

Relativity Theories of Gravitation Physics Concepts by Field

Frames of Reference General Relativity

Category GraphMatrix A

|art|×1

Semantic interpretation

vector iesa

x|cat+art|×|art|

=|cat+art|×1

iesa_AG

Outline

Evaluation: Development of an Own Corpus

12 Participants were asked to answer questions with snippets

Task: find snippets answering 10 different questions in 5 flavors Facts (“What is FTAA”) Opinions (“Is the term ‘dark ages’ justifiable?”) Homonyms (“What is Java?”) Loosely coupled topics (“How are sweets produced?”) Wide topics (“What is origin of human race?”) + sub-groups where meaning is ambiguous (e.g. Java programming language

vs. Indonesian island Java)

Different search engines used (Google, Bing, Yahoo!, …), resulting in 282 distinct snippets.

Note: Created corpus corresponds to our definition of snippets ø 95 terms, min 5, max 756, standard deviation 71.3

Evaluation: Methodology

Evaluation: ESA vs. XESA As we do not have pair comparisons for all snippets, the rank is important:

relevant and similar snippets should be delivered first

Evaluation methodology break-even-point from search engines Definition: break-even-point is measure where precision and recall of a query

are equal. The higher, the better. Average Interpolated Precision is average of all

comparison of all snippets Displaying as Precision – Recall diagram

Baseline ESA: Break-even point at 0.595

Evaluation: Comparing Approaches

Selected parameters (adjusted experimentally) selectBestN: n = 25 Article Link weight:

w є {0.5, 0.75} does not make significant difference

Best results XESAAG(B) (0.643), but no

significant difference from XESAAG(A) (0.641)

~ 9% better than ESA XESACAT is good, but cannot

catch up XESAAG+CAT performs worse

than ESA

0.643 0.641

0.6200.543

Outline

Recommending via Semantic Relatedness

WPWPWP

Paper excerpt: Social Network Analysis and Visualizations for Learning

E-Learning

Blog entry: Visualization of Learning with Web 2.0Recommendation

Semantic Relatedness (XESA)

Conclusions and Future Work

Using Wikipedia as a reference corpus for calculating semantic relatedness for snippets is feasible Enhancing ESA by integrating Wikipedia’s rich semantic structure yields better

results Article Graph improves ESA up to 9%

Performance: not yet applicable to online scenarios

Future Work: Next step: implement semantic relatedness in recommendations Coping with large datasets: make approach performing in real-life contexts Calculate cut-off for “good” concept terms (dimension reduction) Measuring similarity between documents in different languages

Questions?

…Thank you for your attention!

This work was supported by funds from the German Federal Ministry of

Education and Research under the mark 01 PF 08015 A and from the European

Social Fund of the European Union (ESF).

Semantic Relatedness of Web Resources by XESA - Philipp Scholl

Technology

Transcript of Semantic Relatedness of Web Resources by XESA - Philipp Scholl

Steampunk - scholl-gyo.de

Scholl Fall-Winter 2009/10

Scholl Fall Winter 2010

Scholl Footcare Pharmacy Product Guide

Self-relatedness and psychopathology

THE WHITE ROSE READING, WRITING, RESISTANCE...Hans Scholl, Sophie Scholl, and Christoph Probst were sentenced to death on 22 February ... Scholl, as well as copies of the leaflets,

Becker and Scholl (2008) - VWALBP

Scholl Shoes - Scholl Online Shop - SCHOLL AUTUMN ......shoes category. The biomechanical and anatomically designed insoles give you the correct support, while the refined details

Scholl of Nursing

John William Myers Jr., Pearl Mary (Scholl) Stoll, Daisy ...jwmyers3.com/family/scans/Family.pdf · John William Myers Jr., Pearl Mary (Scholl) Stoll, Daisy Magdalena (Scholl) Hammel,

Mending Nets of Relatedness

Scholl Rep

Scholl Lookbook Fw13-14 Lr

Scholl Concepts

Pedigree relatedness and pseudo-phenotypes as a first ... relatedness and pseudo...2 Pedigree relatedness and pseudo -phenotypes as a first approach to assess and maintain genetic

Smart Content Navigation with the SAP HANA Platform...Smart Content Navigation with the SAP HANA Platform Georg Nold (Springer Science+Business Media) and Philipp Scholl (SAP AG) May

LOWER SCHOLL CANYON PARK - glendaleca.gov

Sophie Scholl Response - Yolamrdivis.yolasite.com/resources/Sophie Scholl Response.pdfSophie Scholl Response After watching the film Sophie Scholl, please respond to the following

Relatedness, Complexity and Local Growthmotu- · Relatedness, Complexity and Local Growth ii Abstract We derive a measure of the relatedness between economic activities based on weighted

Green scholl in bali indonezia