Timo Honkela: Research interests in text and metadata mining of literature
-
Upload
timo-honkela -
Category
Education
-
view
381 -
download
0
Transcript of Timo Honkela: Research interests in text and metadata mining of literature
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Timo Honkela
3 Dec 2015
Research interests in text and metadata mining
of literature
A Seminar on Digital Publishing and Research
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Inspirational context:classical literature
The presentation handles mostly general methods.We can discuss how and to which extent thesecan be used in this context.
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Humanities and “compunities”
● Understanding and analysis of cultural artefacts such as novels or poems requires human experience of the world and dedication to the relevant context
● Computers are useful as tireless “distant reading tools” that can, for example, be put to count instances of linguistic expressions, their contextual relations and relation to given categories or to parse the structure of writings at different levels of abstraction
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Digital humanities
● A research area called Digital Humanities usually combines (1) the research questions stemming from humanities and social sciences, (2) their data represented in digital form, and (3) addresses the research questions using quantitative computational analysis methods (statistics, machine learning) on the data along with qualitative research methods
● For the computational analysis, large collections of data (“bid data”) can provide certain benefits but also small data sets can be analyzed in a similar manner
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Text and data mining
● Data mining refers to the computer-based analysis of data collections in order to find interesting or useful patterns, relations or structures in the data
● Data mining is often applied to numerical data but also structured data can be used
● Text mining refers to data mining applied specifically to texts
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Text mining
● Analysis of text documents at different levels of abstraction– Word segments (morphology)
– Lexicon (words, terms, phrases, names, etc.)
– Syntax
– Semantics
– Pragmatics (computationally challenging)● Context!
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Classical example: Learning meaning from context:
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Automated learning of word re
lations
using self-organizing m
ap on text c
ontext data
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Map of Finnish Science
Chemistry
Physics andengineering
Biosciences
Medicine
Culture and society
A fully automated process from terminology extraction (Likey) to semantic space construction (SOM) without any manually constructed resources.
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Further opportunities
● Time series analysis (historical developments)– Ordering a collection of texts
– Analysis of narrative structures
● Social Network Analysis● Sentiment analysis● Names Entity Recognition
(people, places, organisations)
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Analysis of metadata
● Study of the overall structures in a collection● Important sources include author, year of
publication, place, etc.● Can be analyzed in itself or in combination
with the full text data● Study of the quality of the metadata
– Variation
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Hybrid methods:qualitative + quantative
● Qualitative interpretation of quantitative analysys results
● Quantitative analysis of qualitative interpretations
● Parallel analysis with qualitative methods (e.g. grounded theory) and quantative methods (e.g. undersupervised learning)
● Quantative analysis of human subjective and contextual understanding and expression
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Grounded IntersubjectiveConcept Analysis
● A method developed to model how langage is understood in context and with some degree of individuality
● Computational approaches often assume a shared epistemology; here we are interested in the differences in human interpretation
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
GICA analysis of the word healthin State of the Union Addresses
Timo Honkela, A Seminar on Digital Publishing and Research. 3 Dec 2015
Tack så mycket!