Outlineof thisintroduction - Lumière University Lyon...
Transcript of Outlineof thisintroduction - Lumière University Lyon...
Introduction to Text Miningand Natural Language Processing
Master Data MiningJulien Velcin
Outline of this introduction
• Opportunities raised by the digital humanities
• Textual data are ubiquitous
• First definition and relation with data mining
Outline of this introduction
• Opportunities raised by the digital humanities
• Textual data are ubiquitous
• First definition and relation with data mining
ERIC Lab and Digital Humanities
• Numerous new innovative works CS+LSSH• National and international projects• ThatCamp Paris (May 18th/19th 2010)
Manifesto signed by 214 academic people• Digital humanities:– Linguistics, Social Sciences, Humanities– Convergence of various communities– Based on previous works and collaborations– Deeply multi-disciplinary
• Huge issues for both teaching and research
4
Computer Science forinteractional corpora
• CLAPI– Corpus de Langues Parlées en Interaction
ICAR lab (C. Etienne, C. Plantin, L. Mondada…)
– IS developed in collaboration with ERIC (F. Bentayeb, S. Loudcher…)
• Example: advertisers work meeting
5 6
7
Computer Science for historical data
• SyMoGIH
– Système Modulaire de Gestion de l’Information Historique
LARHRA, pôle méthode (F. Beretta, P. Vernus…)
– IS developped in collaboration with ERIC (J. Darmont, O. Boussaïd…)
• Example: database « cartes postales »
ID: 22
Titre : Deux petites filles en pied l'une portant un
panier
Support: Carton Fin
Taille: Photo-carte de Visite
Nature: Noir et Blanc
Legende Verso: Ethel and Grace
Photographe(s) :1:Nom: WADE G
Thématique(s) : Cadrage --> En pied
Genre et âges de la vie --> Enfants
Photographe ID: 10891
Nom: WADE Prénom: G
Sexe: Homme Pays: Angleterre
Technique: Plaque Sèche
Activité Principale: Photographe de
studio
Stock: Oui
Date début activité: 1880
8
• Collaborative work with researchers in communication and information science (JADN)- Centre M. Weber (J.C. Soulage...), Univ. of Parana (Brazil)
• Urgent need of extracting useful information from huge data repositories on the Web, in particular for developing modern research engines
• In particular:– « How to surface the best comments, videos and pictures from a
variety of sources in real time and then how to verify them ? »– « How to quickly surface the best comments and work out which
ones are worth investigating further ? »– « How to identify quickly the key influencers on any particular
story, so they can get inside information or interview them for their news outlets ? »
Computer Science and Journalism
9
Chronolines project (Nguyen et al., 2014)
Kiem-Hieu Nguyen, Xavier Tannier, Véronique Moriceau. Ranking Multidocument Event Descriptions for Building Thematic Timelines. in Proceedings of the 30th International Conference on ComputationalLinguistics (Coling 14). Dublin, Ireland, August 2014.
FR
US
BR
11 Normalized distribution of 15 categories (each category can be associated to multiple topics) 12
Cultural comparison
Computer Science for Literature
• Projet ANRLIFRANUM(MARGE, ERIC, BnF)
• We hire!
Document co-citation analysis to enhance transdisciplinary research (Trujilloet al., 2018), taken from: https://advances.sciencemag.org/content/4/1/e1701130/tab-figures-data
13
Outline of this introduction
• Opportunities raised by the digital humanities
• Textual data are ubiquitous
• First definition and relation with data mining
15 16
News
Short texts
Objective + subjective
Weak structure
17
Blogs, forums, chats
18
19 20
Hidden for obvious reasons
21
Scientific articles
22
Patents
23
Speech to text
Credit: Frank Musiek
24
Outline of this introduction
• Opportunities raised by the digital humanities
• Textual data are ubiquitous
• First definition and relation with data mining
One possible definition
Text mining is...
« finding interesting regularities in large textualdatasets and using them for solving a specific task »
26
Not onlytextualdata...
Title
Source Timestamp
AnnotationsOutline
27
Text Mining, NLP and more…Here is a dialogue from the film 2001: A Space Odyssey: Hal
the computer and Frank the astronaut are playing chess.Frank: Anyway, queen takes pawn.Hal: Bishop takes knight’s pawn.Frank: Rook to King 1.Hal: I’m sorry, Frank. I think you missed it. Queen to
Bishop 3, bishop takes queen, knight takesbishop, mate.
Frank: Yeah, looks like you’re right. I resign.Hal: Thank you for an enjoyable game.Frank: Thank you.
28
Links to data mining
29
Challenges with complex data• Representation of complex data– attributes and pertinence– multi-view indexing– fixing the curse of dimensionality!
• Modalities fusion– different modalities: plain text, images, annotations, etc.– various and heterogeneous sources
• Semantic enrichment– integrating domain knowledge (the so-called “ontologies”)– information retrieval, machine learning, etc.– from novel information to knowledge: the role of validation
30
Various disciplines involved
– Artificial Intelligence (AI)– Statistics, data analysis– Linguistics– Information Retrieval (IR)– Natural Langage Processing (NLP)– Computational linguistics– Data mining– Knowledge engineering and Semantic Web– Machine learning (e.g., deep learning)
31
« Some » difficulties
• A very large vocabulary– concepts and instances
– underlying relation between words (synonymy, antonymy, meronymy, metonymy etc.)
– semantic ambiguity: “He saw the boy with hisglasses.” (who has the glasses?)
• Pairwise comparison– Curse of dimensionality
• Different tasks => different ways for representing the textual content
32
Course schedule
• General introduction and major applications
• Basics in text mining– representing and comparing documents (inverted index, VSM, TFxIDF,
cosine…)
– basic preprocessing (tokenization, stopwords, stemming…)
– beyond words (n-grams, collocations)
– Some advanced notions of NLP (PoS tagging, chunking, WSD…)
• Machine learning for textual data– supervised classification of documents
– Imagiweb: a case study in opinion mining
– unsupervised learning with topic learning (LSA, NMF, LDA)
33
What’s not in this course
• Syntactic / statistical / dependancy parsing• Neural approaches, such as RNN and LSTM
(see the course of deep learning)• Word/document embedding techniques
(see the course of representation learning)• Knowledge engineering and semantic Web• Visualization of textual data
34
Material on the Internet
• Information to Information Retrieval by Christopher
D. Manning, Prabhakar Raghavan and Hinrich
Schütze, Cambridge University Press, 2008http://nlp.stanford.edu/IR-book/information-retrieval-book.html
• Speech and Language Processing (3rd ed., draft) by
Dan Jurafsky and James H. Martin, 2018https://web.stanford.edu/~jurafsky/slp3/
• NCSU MSA Program Course Modulehttp://research.csc.ncsu.edu/ase/courses/analytics/textmining/
35