Document
-
Upload
alexander-panchenko -
Category
Technology
-
view
246 -
download
0
Transcript of Document
Introduction The Method Evaluation Conclusion References
A Graph-Based Approach toSkill Extraction from Text
Higher School of Economics, School of Applied Mathematicsand Information Science, Nizhny Novgorod, Russia
Ilkka Kivimaki1, Alexander Panchenko4,2, Adrien Dessy1,2,Dries Verdegem3, Pascal Francq1, Cedrick Fairon2,
Hugues Bersini3 and Marco Saerens1
1ICTEAM, 2CENTAL, Universite catholique de Louvain, Belgium,3IRIDIA, Universite libre de Bruxelles, Belgium,
4Digital Society Laboratory LLC, Russia
December 18, 20131 / 46
Introduction The Method Evaluation Conclusion References
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
2 / 46
Introduction The Method Evaluation Conclusion References
Reference paper:
Kivimki I., Panchenko A., Dessy A., Verdegem D., Francq P.,Bersini H. and Saerens M. ”A Graph-Based Approach to SkillExtraction from Text”. In Proceedings of the 8th WorkshopTextGraphs-8 Graph-based Methods for Natural LanguageProcessing. EMNLP 2013: Conference on Empirical Methodsin Natural Language Processing. Seattle, USA, October18-21, 2013.
http://aclweb.org/anthology/W/W13/W13-5011.pdf
3 / 46
Introduction The Method Evaluation Conclusion References
Expertise retrieval [Balog et al., 2012]
Expertise Retrieval vs. Expertise Seeking
Expertise retrieval: linking humans to expertise areas, andvice versa from a system-centered perspective. Expertiseretrieval has primarily focused on identifying good topicalmatches between a need for expertise on the one hand andthe content of documents associated with candidate expertson the other hand.
Expertise seeking: linking humans to expertise areas from ahuman-centered perspective. Expertise seeking has beenmainly investigated in the field of knowledge managementwhere the goal is to utilize human knowledge within anorganization as well as possible.
4 / 46
Introduction The Method Evaluation Conclusion References
Expertise retrieval [Balog et al., 2012]
Expertise retrieval: Expert Profiling vs. Expert Seeking
Person: a set of (text) documents generated by an individual.
Expertise: a keyword or a a keyphrase, specifying a field ofknowledge e.g. “Machine Learning”, “Hadoop”, “NLP”, etc.
Expert profiling: given a person, retrieve (profile) itsexpertise.Person → Expertise
Expert retrieval: given an expertise, retrieve persons withsuch expertise.Expertise → Person
5 / 46
Introduction The Method Evaluation Conclusion References
Expertise Retrieval: Earlier Work
TREC Enterprise Track [Balog et al., 2008]State-of-the-Art overview [Balog et al., 2012]A skill extraction system [Crow and DeSanto, 2004]Skill extraction System [Skomoroch et al., 2012]Expertise retrieval in universities [Balog et al., 2007]Expert finding on DBLP data [Deng et al., 2008]e-Human Resource Management system [Biesalski, 2003]
6 / 46
Introduction The Method Evaluation Conclusion References
Expertise Retrieval: Earlier Work
Skill extraction System [Skomoroch et al., 2012]
http://www.freepatentsonline.com/20120197863.pdf
7 / 46
Introduction The Method Evaluation Conclusion References
Expertise Retrieval: Applications
Expertise management systems
Knowledge management in enterprisesEmployee profiling
Reviewer selection for articles
Recommendation systems of
jobsjob applicantswebsites, blog texts, articles
8 / 46
Introduction The Method Evaluation Conclusion References
Expertise retrieval
9 / 46
Introduction The Method Evaluation Conclusion References
Expertise retrieval
10 / 46
Introduction The Method Evaluation Conclusion References
Skill extraction
We focus on skill extraction from texts,i.e. associating skills with text documents.
11 / 46
Introduction The Method Evaluation Conclusion References
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
12 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
13 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
The Elisit system for skill extraction
Original goal of the system:
Associate professional skills to people based on texts that theyproduce (emails, blogs, forums, articles etc.).
Tools:
List of skills extracted from LinkedIn.
The skills are linked to corresponding Wikipedia pages.
Method:
1 Find Wikipedia pages relevant to a target document.
2 Use spreading activation on Wikipedia’s hyperlink network tofind skills that are “close” or “central” to these relevant pages.
14 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Skill extraction using Wikipedia
↑
15 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Example
16 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Example
17 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Example
18 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Size of the problem
Our current version of English Wikipedia consists of
n = 3 983 338 encyclopedia entriesm = 247 560 469 links
27 513 of the encyclopedia entries correspond to LinkedInskills.
19 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
Implementation
For computing the similarities between the target documentand all Wikipedia pages, we use the Gensim library [Rehurekand Sojka, 2010].
This part of the Elisit system is called the text2wiki
module.Currently the bottleneck of the computation
For performing spreading activation, we use the sparse matrixlibrary of SciPy.
This part is called the wiki2skill module.
20 / 46
Introduction The Method Evaluation Conclusion References
Overview of the system
The Elisit system
At the moment not fully functional...
21 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
22 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Popular Article about Natural Language Understanding
23 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Popular Article about Natural Language Understanding
24 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Blog Article about SEO Marketing
25 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Blog Article about SEO Marketing
26 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Wikipedia Article about Geo Information Systems
27 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Wikipedia Article about Geo Information Systems
28 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Scientific Article about Graph Mining
29 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Scientific Article about Graph Mining
30 / 46
Introduction The Method Evaluation Conclusion References
Sample Queries
Try it. . .
Elisit Web Interfacehttp://elisit.cental.be/
Elisit Web Servicehttp://elisit.cental.be:8080/
This is only a demo: not optimized for multiple-user queries,high load, fast response, etc.
31 / 46
Introduction The Method Evaluation Conclusion References
Association with Wikipedia
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
32 / 46
Introduction The Method Evaluation Conclusion References
Association with Wikipedia
Association with Wikipedia
1. Find Wikipedia pages relevant to a target document.
We compute the similarity between the input document andall Wikipedia pages.
We tried four different models:
1 TF-IDF (300,000 dimensions)2 LogEntropy (300,000 dimensions)3 LogEntropy + LSA (200 dimensions)4 LogEntropy + LDA (200 topics)
⇒ the target document is represented as a semantic vector ofsize n, the number of Wikipedia pages (inspired byESA [Gabrilovich and Markovitch, 2007]).
33 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
34 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
2. Use Wikipedia’s hyperlink network to find skills that are “close”or “central” to these relevant pages.
INITIAL PAGES SKILLS
35 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
Formalization of spreading activation by Shrager et al. [1987]:If a(0) is a vector of initial activations, then after each timestep t, the vector of activations is
a(t) = γa(t − 1) + λWTa(t − 1) + c(t)
Parameters
T , the number of time stepsγ ∈ [0, 1] is a decay factorλ ∈ [0, 1] is a friction factorc(t) is an activation source vectorThe link weight, element wij of W determines the amount ofactivation that is spread from i to j .
36 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
a(t) = γa(t − 1) + λWTa(t − 1) + c(t)
Thorough model selection difficult because of the size of theproblem
We experimented with three versions of the model:
model 1: a(t) = WTa(t − 1)model 2: a(t) = WTa(t − 1) + a(t − 1)model 3: a(t) = WTa(t − 1) + a(0)
In addition, W is constrained to be row-stochastic.
More focus on the selection of the link weights than otherparameters.
37 / 46
Introduction The Method Evaluation Conclusion References
Spreading activation in Wikipedia
Spreading activation in Wikipedia
Observation from initial results:
Hubs get easily activated even if they are not relevant.Common phenomenon with large graphs [Brand, 2005; vonLuxburg et al., 2010]
Solution:
we bias the spreading to avoid hubs by
wij =παj∑
(i ,k)∈Eπαk
πj is a popularity index of j (degree / PageRank / HITS).
If α = 0, no biasing; if α < 0 popular nodes are avoided.
Biased random walks have e.g. shorter return times thanunbiased random walks [Fronczak and Fronczak, 2009].
38 / 46
Introduction The Method Evaluation Conclusion References
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
39 / 46
Introduction The Method Evaluation Conclusion References
Evaluation of system
We evaluated the biasing strategy by seeing how well the systemactivates related skills, defined by LinkedIn.
≤ 20
40 / 46
Introduction The Method Evaluation Conclusion References
Evaluation of system
We tested the biasing strategy by seeing how well the systemactivates related skills, defined by LinkedIn.
Pre@5 Pre@10 R-Pre Rec@100α din PR HITS din PR HITS din PR HITS din PR HITS
0 0.119 0.119 0.119 0.156 0.156 0.156 0.154 0.154 0.154 0.439 0.439 0.439-0.2 0.206 0.238 0.206 0.222 0.216 0.213 0.172 0.193 0.185 0.469 0.469 0.494-0.4 0.225 0.263 0.169 0.203 0.200 0.150 0.185 0.204 0.148 0.503 0.498 0.476-0.6 0.238 0.225 0.119 0.200 0.197 0.141 0.186 0.193 0.119 0.511 0.517 0.418-0.8 0.213 0.181 0.075 0.191 0.197 0.113 0.171 0.185 0.109 0.515 0.524 0.384-1 0.169 0.156 0.063 0.178 0.197 0.091 0.154 0.172 0.097 0.493 0.518 0.336
Table : The effect of the biasing parameter α and the choice ofpopularity index on the results in the evaluation of the module.
E.g., the top 5 most activated skills of all the ≈ 27 000 skillscontain 1-2 of the ≤ 20 related skills, on average.
Also, biasing definitely improves retrieval results.
41 / 46
Introduction The Method Evaluation Conclusion References
Evaluation of system
We also ran a test for comparing the different language models.
VSM Pre@5 Pre@10 R-Pre Rec@100TF-IDF 0.231 0.214 0.190 0.516
LogEntropy 0.216 0.212 0.193 0.525LogEnt + LSA 0.180 0.181 0.163 0.491LogEnt + LDA 0.193 0.174 0.159 0.470
Table : Comparison of the different vector space models of the system inthe performance of the whole system.
42 / 46
Introduction The Method Evaluation Conclusion References
Table of Contents
1 Expertise retrieval and skill extraction
2 The Elisit system for skill extractionOverview of the systemSample QueriesAssociation with WikipediaSpreading activation in Wikipedia
3 Evaluation of system
4 Conclusion and future work
43 / 46
Introduction The Method Evaluation Conclusion References
Conclusion
The Elisit system extracts explicit skills that are related toan arbitrary text input.
Combination of ESA-style conceptual mapping and spreadingactivation on the Wikipedia network
Evaluation experiments suggest that using popularity-biasedspreading activation improves retrieval results.
44 / 46
Introduction The Method Evaluation Conclusion References
Future work
Improvement of link weights, e.g. by
computing content similarity of the Wikipedia pagestrying other structural similarity measuresusing the category memberhips of pages
Comparison with other strategies
More sophisticated (e.g. hierarchical) representation of results.
Also, the methodology could be applied for other purposes,e.g. a general topic model by replacing skills with topics.
45 / 46
Introduction The Method Evaluation Conclusion References
References
Krisztian Balog, Toine Bogers, Leif Azzopardi, Maarten De Rijke, and Antal Van Den Bosch. Broad expertiseretrieval in sparse data environments. In Proceedings of the 30th annual international ACM SIGIR conferenceon Research and development in information retrieval, pages 551–558. ACM, 2007.
Krisztian Balog, Paul Thomas, Nick Craswell, Ian Soboroff, Peter Bailey, and Arjen P De Vries. Overview of thetrec 2008 enterprise track. Technical report, DTIC Document, 2008.
Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval. Foundations andTrends in Information Retrieval, 6(2-3):127–256, 2012.
Ernst Biesalski. Knowledge management and e-human resource management. FGWM 2003, 2003.
M. Brand. A random walks perspective on maximizing satisfaction and profit. Proceedings of the 2005 SIAMInternational Conference on Data Mining, 2005.
Dan Crow and John DeSanto. A hybrid approach to concept extraction and recognition-based matching in thedomain of human resources. In Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE InternationalConference on, pages 535–541. IEEE, 2004.
Hongbo Deng, Irwin King, and Michael R Lyu. Formal models for expert finding on dblp bibliography data. InData Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 163–172. IEEE, 2008.
Agata Fronczak and Piotr Fronczak. Biased random walks in complex networks: The role of local navigation rules.Physical Review E, 80(1):016107, 2009.
Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-based explicitsemantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on Artifical intelligence,pages 1606–1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
Radim Rehurek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. InProceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta,Malta, May 2010. ELRA.
Jeff Shrager, Tad Hogg, and Bernardo A Huberman. Observation of phase transitions in spreading activationnetworks. Science, 236(4805):1092–1094, 1987.
U. von Luxburg, A. Radl, and M. Hein. Getting lost in space: large sample analysis of the commute distance.Proceedings of the 23th Neural Information Processing Systems conference (NIPS 2010), pages 2622–2630,2010.
46 / 46