Outlineof thisintroduction - Lumière University Lyon...

Introduction to Text Miningand Natural Language Processing

Master Data MiningJulien Velcin

Outline of this introduction

• Opportunities raised by the digital humanities

• Textual data are ubiquitous

• First definition and relation with data mining





ERIC Lab and Digital Humanities

• Numerous new innovative works CS+LSSH• National and international projects• ThatCamp Paris (May 18th/19th 2010)

Manifesto signed by 214 academic people• Digital humanities:– Linguistics, Social Sciences, Humanities– Convergence of various communities– Based on previous works and collaborations– Deeply multi-disciplinary

• Huge issues for both teaching and research

4

Computer Science forinteractional corpora

• CLAPI– Corpus de Langues Parlées en Interaction

ICAR lab (C. Etienne, C. Plantin, L. Mondada…)

– IS developed in collaboration with ERIC (F. Bentayeb, S. Loudcher…)

• Example: advertisers work meeting

5 6

7

Computer Science for historical data

• SyMoGIH

– Système Modulaire de Gestion de l’Information Historique

LARHRA, pôle méthode (F. Beretta, P. Vernus…)

– IS developped in collaboration with ERIC (J. Darmont, O. Boussaïd…)

• Example: database « cartes postales »

ID: 22

Titre : Deux petites filles en pied l'une portant un

panier

Support: Carton Fin

Taille: Photo-carte de Visite

Nature: Noir et Blanc

Legende Verso: Ethel and Grace

Photographe(s) :1:Nom: WADE G

Thématique(s) : Cadrage --> En pied

Genre et âges de la vie --> Enfants

Photographe ID: 10891

Nom: WADE Prénom: G

Sexe: Homme Pays: Angleterre

Technique: Plaque Sèche

Activité Principale: Photographe de

studio

Stock: Oui

Date début activité: 1880

8

• Collaborative work with researchers in communication and information science (JADN)- Centre M. Weber (J.C. Soulage...), Univ. of Parana (Brazil)

• Urgent need of extracting useful information from huge data repositories on the Web, in particular for developing modern research engines

• In particular:– « How to surface the best comments, videos and pictures from a

variety of sources in real time and then how to verify them ? »– « How to quickly surface the best comments and work out which

ones are worth investigating further ? »– « How to identify quickly the key influencers on any particular

story, so they can get inside information or interview them for their news outlets ? »

Computer Science and Journalism

9

Chronolines project (Nguyen et al., 2014)

Kiem-Hieu Nguyen, Xavier Tannier, Véronique Moriceau. Ranking Multidocument Event Descriptions for Building Thematic Timelines. in Proceedings of the 30th International Conference on ComputationalLinguistics (Coling 14). Dublin, Ireland, August 2014.

FR

US

BR

11 Normalized distribution of 15 categories (each category can be associated to multiple topics) 12

Cultural comparison

Computer Science for Literature

• Projet ANRLIFRANUM(MARGE, ERIC, BnF)

• We hire!

Document co-citation analysis to enhance transdisciplinary research (Trujilloet al., 2018), taken from: https://advances.sciencemag.org/content/4/1/e1701130/tab-figures-data

13





15 16

News

Short texts

Objective + subjective

Weak structure

17

Blogs, forums, chats

18

19 20

Hidden for obvious reasons

21

Scientific articles

22

Patents

23

Speech to text

Credit: Frank Musiek

24





One possible definition

Text mining is...

« finding interesting regularities in large textualdatasets and using them for solving a specific task »

26

Not onlytextualdata...

Title

Source Timestamp

AnnotationsOutline

27

Text Mining, NLP and more…Here is a dialogue from the film 2001: A Space Odyssey: Hal

the computer and Frank the astronaut are playing chess.Frank: Anyway, queen takes pawn.Hal: Bishop takes knight’s pawn.Frank: Rook to King 1.Hal: I’m sorry, Frank. I think you missed it. Queen to

Bishop 3, bishop takes queen, knight takesbishop, mate.

Frank: Yeah, looks like you’re right. I resign.Hal: Thank you for an enjoyable game.Frank: Thank you.

28

Links to data mining

29

Challenges with complex data• Representation of complex data– attributes and pertinence– multi-view indexing– fixing the curse of dimensionality!

• Modalities fusion– different modalities: plain text, images, annotations, etc.– various and heterogeneous sources

• Semantic enrichment– integrating domain knowledge (the so-called “ontologies”)– information retrieval, machine learning, etc.– from novel information to knowledge: the role of validation

30

Various disciplines involved

– Artificial Intelligence (AI)– Statistics, data analysis– Linguistics– Information Retrieval (IR)– Natural Langage Processing (NLP)– Computational linguistics– Data mining– Knowledge engineering and Semantic Web– Machine learning (e.g., deep learning)

31

« Some » difficulties

• A very large vocabulary– concepts and instances

– underlying relation between words (synonymy, antonymy, meronymy, metonymy etc.)

– semantic ambiguity: “He saw the boy with hisglasses.” (who has the glasses?)

• Pairwise comparison– Curse of dimensionality

• Different tasks => different ways for representing the textual content

32

Course schedule

• General introduction and major applications

• Basics in text mining– representing and comparing documents (inverted index, VSM, TFxIDF,

cosine…)

– basic preprocessing (tokenization, stopwords, stemming…)

– beyond words (n-grams, collocations)

– Some advanced notions of NLP (PoS tagging, chunking, WSD…)

• Machine learning for textual data– supervised classification of documents

– Imagiweb: a case study in opinion mining

– unsupervised learning with topic learning (LSA, NMF, LDA)

33

What’s not in this course

• Syntactic / statistical / dependancy parsing• Neural approaches, such as RNN and LSTM

(see the course of deep learning)• Word/document embedding techniques

(see the course of representation learning)• Knowledge engineering and semantic Web• Visualization of textual data

34

Material on the Internet

• Information to Information Retrieval by Christopher

D. Manning, Prabhakar Raghavan and Hinrich

Schütze, Cambridge University Press, 2008http://nlp.stanford.edu/IR-book/information-retrieval-book.html

• Speech and Language Processing (3rd ed., draft) by

Dan Jurafsky and James H. Martin, 2018https://web.stanford.edu/~jurafsky/slp3/

• NCSU MSA Program Course Modulehttp://research.csc.ncsu.edu/ase/courses/analytics/textmining/

35

Outlineof thisintroduction - Lumière University Lyon...

Documents

Transcript of Outlineof thisintroduction - Lumière University Lyon...