Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth...

19
Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September 2017 Scrambling for Metadata

Transcript of Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth...

Page 1: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Using Topic Modeling and Word2Vec to explore the Archives of the EC

Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen

Brussels, September 2017

Scrambling for Metadata

Page 2: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 3: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 4: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 5: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Eurovoc

Page 6: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 7: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 8: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 9: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Corpus

• 24.787 pdf documents, representing 138,3 GB

• Period 1958 -1982, with documents in French, Dutch, German, Italian, Danish, English and Greek

• Tombstone metadata

• Conversion to .txt and split per language => 205.370 . txt files, representing 7,4 GB

• 835.717.292 words or 1.671.434 pages

Page 10: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 11: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 12: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 13: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Topic labeling ?

• “More of an art than a science” (Chang et al, 2009)

• Traditionally, two automated approaches :

• deriving labels from the top tokens of the topics

• label with concept from a controlled vocabulary

Page 14: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Topic labeling ?

• Hulpus et al (2013) & Allahyaria and Kochuta (2015) use the graph structure of DBPedia to rank the different label candidates

• But - graph structure of DBPedia as a knowledge structure is not terribly coherent …

• Our approach : use pre-trained Word2Vec to exclude in an iterative manner the “outliers” from the tokens of a topic and match the remaining token with Eurovoc

Page 15: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 16: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 17: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 18: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September
Page 19: Scrambling for Metadata · Using Topic Modeling and Word2Vec to explore the Archives of the EC Seth van Hooland, Mathias Coeckelbergs, Ettore Rizza and Simon Hengchen Brussels, September

Other scenario’s ?

• Use pre-trained Word2Vec to bring down the amount of terms per topic to for example 3

• Roll out Word2Vec trained on the EC corpus in order to identify the term most closely associated with those 3

• But : number of parameters to configure / play with exponentially increases the overhead of manual evaluation