analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data...
Transcript of analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data...
![Page 1: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/1.jpg)
John P. McCrae
Insight Centre for Data Analytics
National University of Ireland Galway
Topic extraction, expert finding and trend analysis from scientific literature
![Page 2: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/2.jpg)
Knowledge Extraction from Text
- with Saffron -
retrieving awareness
of someone something information
descriptions skills
addition of metadata
![Page 3: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/3.jpg)
Original Use Case:Expert Finding
![Page 4: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/4.jpg)
Architecture
![Page 5: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/5.jpg)
Step 1 - Corpus Indexing
...
![Page 6: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/6.jpg)
Step 2 - Domain Modelling
…concepts such as Machine Translation…
…Noun phrases and other elements…
Trigger Words
Term
Term
![Page 7: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/7.jpg)
Step 3 - Topic (term) Extraction
NNS JJ IN NNP NNPconcepts such as Machine Translation
Candidate Weirdness Relevance Domain Pertinence ···
Concepts 0.1 0.6 0.8 ···
Machine Translation 0.8 0.7 0.7 ···
Candidate selection by voting
![Page 8: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/8.jpg)
Term Extraction – ACL Anthology
![Page 9: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/9.jpg)
Step 4 - Author Consolidation
John McCrae
John P. McCrae
McCrae, J.P.
{ “honorific”: null “givenName” : “John”, “middleInitial”: “P”, “familyName”: “McCrae”}
![Page 10: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/10.jpg)
Step 5 - DBpedia Lookup
http://dbpedia.org/resource/Machine_translation
“Machine Translation”
![Page 11: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/11.jpg)
Step 6 - Topic Statistics
Topic Generality
Weaknesses:● Favours common terms● Denormalized PMI?⇒ Multi-factor metric
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 12: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/12.jpg)
Step 7 - Connect Authors
Topic 1: TF-IAF: 0.3
Topic 2: TF-IAF: 0.7
Topic 3: TF-IAF: 0.5
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 13: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/13.jpg)
Step 8 - Author Similarity
Cosine
TF-IRF TF-IRF
.6
.7
.5
...
.3
.2
.8
...
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 14: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/14.jpg)
Step 9 - Topic Similarity
Cosine
Topic Score
TopicScore
.6
.7
.5
...
.3
.2
.8
...Topic 1 Topic 2
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 15: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/15.jpg)
Expertise Mining
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 16: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/16.jpg)
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
Expertise Mining
![Page 17: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/17.jpg)
Step 10 - Taxonomy Construction
● Reduce topic-topic graph to directed acyclic graph○ Simpler hierarchical structure for corpus
● Minimum spanning tree● Directed to ensure most general
nodes are at the top
![Page 18: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/18.jpg)
Terms to Taxonomy - ACL Anthology
![Page 19: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/19.jpg)
Taxonomy Extraction – ACL Anthology
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 20: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/20.jpg)
Georgeta Bordea (2013) Domain adaptive extraction of topical hierarchies for Expertise Mining. PhD Thesis, National University of Ireland, Galway
![Page 21: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/21.jpg)
Heterogeneous graph
Metadata links
Term Extraction
Topic Similarity
Author Similarity
![Page 22: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/22.jpg)
Industry Applications
Content Analysis for Book Recommendation
Semantic Search on Digital News Archives
![Page 24: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/24.jpg)
![Page 25: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/25.jpg)
Journalist Expertise
![Page 26: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/26.jpg)
Towards Saffron 3
• Saffron was developed primarily by Georgeta Bordea, Barry Coughlan (and many others)
• Technical improvements• One language (Java), one database (Lucene), one build system
(Maven) etc.
• Refactor code with existing libraries
o V2.0: 14,500 Java LoC, 35,919 Python LoC
o V3.0: 7,000 Java LoC
![Page 27: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/27.jpg)
Towards Saffron 3
• Saffron has attracted a lot of research and commercial attention
• But, Saffron is more importantly a research project.• Next Step: Establish new baseline for
o Term Extraction
• Based on Astrakhanstev 2017
o Taxonomy Learning
• Use TExEval datasets (WordNet, EuroVoc)
• New datasets that are taxonomic, not hypernymic (e.g.,
ACM Computing Classification System).
N. Astrakhantsev. ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. https://arxiv.org/abs/1611.07804TExEval @ SemEval 2016: http://alt.qcri.org/semeval2016/task13/
• Then: New algorithms :)
![Page 28: analysis from scientific literature Topic extraction ......John P. McCrae Insight Centre for Data Analytics National University of Ireland Galway Topic extraction, expert finding and](https://reader033.fdocuments.us/reader033/viewer/2022051923/6010eba3ab7e3f39c26cfe65/html5/thumbnails/28.jpg)
Conclusion
• Big document collections are hard to understand• In Academia
• In Industry
• Taxonomies are the natural way to explore datasets• Evaluating the quality of a taxonomy is very hard
• Author metadata for documents lets us understand and find experts
• Heterogeneous graphs give new options for exploring document collections