Post on 23-Jan-2015
description
KOM - Multimedia Communications LabProf. Dr.-Ing. Ralf Steinmetz (Director)
Dept. of Electrical Engineering and Information TechnologyDept. of Computer Science (adjunct Professor)
TUD – Technische Universität Darmstadt Rundeturmstr. 10, D-64283 Darmstadt, Germany
Tel.+49 6151 166150, Fax. +49 6151 166152 www.KOM.tu-darmstadt.de
© 2010 author(s) of these slides including research results from the KOM research network and TU Darmstadt. Otherwise it is specified at the respective slide
Dipl.-Inform. Philipp SchollDoreen Böhnstedt, M.Sc.Dipl.-Inform. Renato Domínguez GarcíaDr.-Ing. Christoph RensingProf.Dr.-Ing. Ralf Steinmetz
Philipp.Scholl@KOM.tu-darmstadt.de Tel.+49 6151 166115
10. April 20232010-10-01 EC-TEL Presentation Scholl.ppt
Extended Explicit Semantic Analysis for Calculating Semantic Relatedness of Web
Resources
Presentation 2010/10/01 EC-TEL, Barcelona
Recommendation
WPWPWPWPWPWP
KOM – Multimedia Communications Lab 2
Outline
A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook
KOM – Multimedia Communications Lab 3
Scenario: Crokodil
Crokodil – supporting Resource based Learning with Web Resources Collecting Fragments of Web Resources
(“Snippets”) Organize Snippets via (semantic) tagging (with
types Person, Event, Goal, Location, …) Underlying structure: Personal and Community
Knowledge Networks
Embedded as an add-on into the sidebar of the web browser Firefox
KOM – Multimedia Communications Lab 4
Study Results: Snippets of Web Resources
Participants of study [SBB09] found saving fragments of web resources (instead of whole web pages) very useful
Snippets ≡ Fragments of web resources Definite, narrow scope of topic Cover user’s information needs
Findings in Study [SBB09] Comparison 1357 snippets vs.
705 web resources Snippets: 70% smaller than 100 words Web resources: 70% smaller than 1000 words
Comparison: Snippets vs. HTML Pages
0
10
20
30
40
50
60
70
80
90
100
1 10 100 1000 10000 100000
Size in words / tokens
Cu
mu
late
d P
erc
en
tag
e
Snippets Complete HTML Pages
[SBB09] Scholl, P., Benz, B. F., Böhnstedt, D., Rensing, C., Schmitz, B., Steinmetz, R. (2009): Implementation and Evaluation of a Tool for Setting Goals in Self-Regulated Learning with Web Resources, In: Learning in the Synergy of Multiple Disciplines, EC-TEL 2009, pp. 521-534, Springer-Verlag Berlin/Heidelberg
KOM – Multimedia Communications Lab 5
Structural Recommendations
Suggesting related resources in Crokodil: based on structure of knowledge network
Whether the resource has already been saved in the personal or community knowledge network
Based on explicit connections between current web resource and tags
Blog entry: Visualization of Learning with Web 2.0
Paper excerpt: Social Network Analysis and Visualizations for Learning
Web 2.0Life long learning
Recommendation
EC-TEL 2010E-Learning TEL
KOM – Multimedia Communications Lab 6
Challenge: Sparse Knowledge Networks
Direct, explicit connections do not always exist
Knowledge Networks are sparse
Goal: semantic recommendation based on snippets.
Some measure of similarity / relatedness between snippets is needed for recommendation
Blog entry: e-learning in Web 2.0Paper excerpt: Web 2.0 for learning
Web 2.0Life long learning
TEL
E-learningRecommendation
?
KOM – Multimedia Communications Lab 7
Implications for Recommending Snippets
Snippets Are mostly short Have only few significant terms Learning scenario needs recommendation of related, not necessarily similar snippets
Semantic Relatedness vs. Semantic Similarity
Challenge: Vocabulary gap Different wording and terminology
Only marginally similar in terminology, but semantically strongly related
Naïve Bag-Of-Words approach not feasible for comparison
One approach to accomodate these properties: Explicit Semantic Analysis
“TEL refers to the assistance of activities in knowledge acquisition through technology”
“E-Learning comprises all forms of
electronically supported
learning and teaching.”
?
KOM – Multimedia Communications Lab 8
Outline
A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook
KOM – Multimedia Communications Lab 9
ESA*
Calculates relatedness between words / text [GM07] Based on reference corpus containing semantically distinct documents Allows comparison between conceptualized abstractions of documents
Resulting semantic vector iesa can be compared to other vectors (e.g. by cosine measure)
Base Approach: Explicit Semantic Analysis
x =|terms|×1 n×1
document d1n documents from corpus
Preprocessing steps*
Semantic interpretation Matrix Mint
* Contain:1. Tokenization2. Stemming3. Calculation of TF-IDF
Semantic interpretation vector iesa
n×|terms|n 1×|terms| vectors
document d2
comparison
[GM07] Gabrilovich, E. & Markovitch, S. (2007): Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6-12
KOM – Multimedia Communications Lab 10
Wikipedia as Reference Corpus
ESA commonly uses Wikipedia as feasible reference corpus
Wikipedia: Collaboratively edited encyclopedic knowledge
German Wikipedia: 1 Mio. articles Each article corresponds to a semantic concept (topic) Articles are densely interconnected by Wiki-Links
German Wikipedia: 25 Mio. links Articles are semantically grouped into categories
German Wikipedia: 122k categories Articles are connected to corresponding / similar articles in
other languages (266 languages available)
Sou
rce:
wik
iped
ia.o
rg
KOM – Multimedia Communications Lab 11
Observation and Hypothesis
Observation: ESA only considers article text Ignores semantic information contained in Wikipedia that can be used:
Connectivity by links Category information
Implement different enhancements by semantic enrichment: eXtended Explicit Semantic Analysis (XESA)
Hypothesis: Semantically enriching interpretation vector by using this additional
information readily provided by Wikipedia enhances task of comparing snippets
KOM – Multimedia Communications Lab 12
Outline
A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook
KOM – Multimedia Communications Lab 13
XESA – Overview
ESA XESAAG XESACAT XESAAG+CAT
Article content
Article Graph
Category Information
KOM – Multimedia Communications Lab 14
Article Graph Extension
Additional factors (not shown here): Article-Link weight wlink-weight – determines weight of Article Graph selectBestN – selection of only n best values of iesa for complexity
reduction
Albert Einstein
Gravitation
Space
Matter
Curvature
Black Hole
General Relativity
Catholic SchoolJewish
Ulm
Article GraphMatrix A
|articles|×1
Semantic interpretation
vector iesa
x|articles|×|articles|
=|articles|×1
iesa_AG
KOM – Multimedia Communications Lab 15
Category Graph Extension
As categories are appended concept space, the resulting interpretation vector has more dimensions
General RelativityMisner SpaceAnti-Gravity Atom Heat
Fundamental Physics Concepts Concepts of Heaven
Relativity Theories of Gravitation Physics Concepts by Field
Frames of Reference General Relativity
Category GraphMatrix A
|art|×1
Semantic interpretation
vector iesa
x|cat+art|×|art|
=|cat+art|×1
iesa_AG
KOM – Multimedia Communications Lab 16
Outline
A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook
KOM – Multimedia Communications Lab 17
Evaluation: Development of an Own Corpus
12 Participants were asked to answer questions with snippets
Task: find snippets answering 10 different questions in 5 flavors Facts (“What is FTAA”) Opinions (“Is the term ‘dark ages’ justifiable?”) Homonyms (“What is Java?”) Loosely coupled topics (“How are sweets produced?”) Wide topics (“What is origin of human race?”) + sub-groups where meaning is ambiguous (e.g. Java programming language
vs. Indonesian island Java)
Different search engines used (Google, Bing, Yahoo!, …), resulting in 282 distinct snippets.
Note: Created corpus corresponds to our definition of snippets ø 95 terms, min 5, max 756, standard deviation 71.3
KOM – Multimedia Communications Lab 18
Evaluation: Methodology
Evaluation: ESA vs. XESA As we do not have pair comparisons for all snippets, the rank is important:
relevant and similar snippets should be delivered first
Evaluation methodology break-even-point from search engines Definition: break-even-point is measure where precision and recall of a query
are equal. The higher, the better. Average Interpolated Precision is average of all
comparison of all snippets Displaying as Precision – Recall diagram
Baseline ESA: Break-even point at 0.595
0.595
KOM – Multimedia Communications Lab 19
Evaluation: Comparing Approaches
Selected parameters (adjusted experimentally) selectBestN: n = 25 Article Link weight:
w є {0.5, 0.75} does not make significant difference
Best results XESAAG(B) (0.643), but no
significant difference from XESAAG(A) (0.641)
~ 9% better than ESA XESACAT is good, but cannot
catch up XESAAG+CAT performs worse
than ESA
0.643 0.641
0.6200.543
KOM – Multimedia Communications Lab 20
Outline
A Learning Scenario – Knowledge Networks and Snippets Measuring Semantic Relatedness with ESA Proposed Enhancements to ESA Evaluation Conclusions & Outlook
KOM – Multimedia Communications Lab 21
Recommending via Semantic Relatedness
WPWPWP
WPWPWP
Paper excerpt: Social Network Analysis and Visualizations for Learning
Web 2.0Life long learning
E-Learning
TEL
Blog entry: Visualization of Learning with Web 2.0Recommendation
Semantic Relatedness (XESA)
KOM – Multimedia Communications Lab 22
Conclusions and Future Work
Using Wikipedia as a reference corpus for calculating semantic relatedness for snippets is feasible Enhancing ESA by integrating Wikipedia’s rich semantic structure yields better
results Article Graph improves ESA up to 9%
Performance: not yet applicable to online scenarios
Future Work: Next step: implement semantic relatedness in recommendations Coping with large datasets: make approach performing in real-life contexts Calculate cut-off for “good” concept terms (dimension reduction) Measuring similarity between documents in different languages
KOM – Multimedia Communications Lab 23
Questions?
…Thank you for your attention!
This work was supported by funds from the German Federal Ministry of
Education and Research under the mark 01 PF 08015 A and from the European
Social Fund of the European Union (ESF).