Trausan-Matu: Natural Language Processing and Topic Modelling
Transcript of Trausan-Matu: Natural Language Processing and Topic Modelling
![Page 1: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/1.jpg)
Natural Language Processing andTopic Modelling
Ştefan Trăușan-Matu
University Politehnica of BucharestRomanian Academy Research Institute for Artificial Intelligence
COST Action IS1310 - Reassembling the Republic of Letters23rd March St’. Anne’s College University of Oxford
![Page 2: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/2.jpg)
Natural Language Processing (NLP)
• Input:
– Text in digital format (strings of characters)
• document
• corpus
• question
• transcription of a monologue or of a conversation
• instant messenger log
• discussion forum, social network
• corpus of interlinked documents (e.g. letters)
• dialog
2Republic of Letters, University of Oxford23 March 2015
![Page 3: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/3.jpg)
Natural Language Processing• Output:
– Text(s) in digital format
• translation – e.g. Google translate
• document(s) summary - summarizers
• answer – question answering
• clusters of documents
– Automatically generated annotations
– List of topics in the text
– Links among topics
– Similar documents
– Links among documents – intertextuality
– Threads of discussion, COLLABORATION
– Other data
• collocations
• structures (syntactical, discourse, rhetorical, etc.)
• opinions In text
• participation & collaboration degrees in conversations
• …3
![Page 4: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/4.jpg)
NLP approaches
• Grammar-based
• Statistical (corpus-based, machinelearning)
– unsupervized (clustering, LSA, LDA)
– Supervized
annotated corpus
learned model
automated annotation
4Republic of Letters, University of Oxford23 March 2015
![Page 5: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/5.jpg)
Text annotation• Space
• Time
• Named Entities
• Links
• Syntactic
• Semantic
• Pragmatic
• Discourse
• Rhetoric
• …5Republic of Letters, University of Oxford23 March 2015
![Page 6: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/6.jpg)
Text Annotators
6Republic of Letters, University of Oxford23 March 2015
![Page 7: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/7.jpg)
Topic Modeling No generally accepted definition for a “topic” in NLPDocument clustersAbstractions based on document clustersLabels;Centroids, etc
(Word, Probability) pairs
Bayesian statistical modelsTopics – distributions over wordsDocuments – distributions over topicsGenerative modelTopic IntertwiningConceptually similar to the ideas of Mikhail BakhtinTopics and voices
7Republic of Letters, University of Oxford23 March 2015
![Page 8: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/8.jpg)
Topic Modeling (2)• LSA/pLSA/hLDA/CTM
– Each newer version corrects some flaws of theearlier ones
• LDA
– Readily available
• Mallet
• Easily reproducible experiments
8Republic of Letters, University of Oxford23 March 2015
![Page 9: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/9.jpg)
The LSA idea
• Reducing the dimensionality of the vector space,similarly to the least squares method
• The effect is the creation of semantic spacescontaining semantically related words
• Bag-of-words approach
• http://lsa.colorado.edu
9Republic of Letters, University of Oxford23 March 2015
LSA - Vector space model
![Page 10: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/10.jpg)
Singular value decomposition (SVD)
n=min(t,d)Tdxnnxntxntxd DSTA
10Republic of Letters, University of Oxford23 March 2015
09.016.061.073.025.05dim
58.058.000.000.058.04dim
41.015.037.059.057.03dim
65.035.051.033.030.02dim
26.070.048.013.044.01dim
cos truckcarmoonastronautmonaut
T T
39.000.000.000.000.0
00.000.100.000.000.0
00.000.028.100.000.0
00.000.000.059.100.0
00.000.000.000.016.2
S
22.041.019.063.029.053.05dim
58.058.000.058.000.000.04dim
33.012.020.045.075.028.03dim
41.022.063.019.053.029.02dim
12.033.045.020.028.075.01dim
654321 dddddd
DT
101000
011001
000011
000010
000101cos
654321
truck
car
moon
astronaut
monaut
dddddd
ATerms-documents array
(ex. from Manning and Schutze, 1999)
Reduced A
• By SVD on maps the n-dimension space ona k-dimension one, with n >>k
• Common values for k are 100 and 150.
2|||| AA
Tdxx DSB 222
65.035.000.130.084.046.02dim
26.071.097.004.060.062.11dim
654321 dddddd
B
![Page 11: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/11.jpg)
LSA based text processing
The most significant 20 wordsfrom Plato
[Plato|TheApology,Justin|TheSecondApology-(0.6475);Plato|TheRepublic.7,Irenaeus|AgainstHeresies.6-(0.6095)]
The similarity of Plato’s workswith the works of other writers
11Republic of Letters, University of Oxford23 March 2015
![Page 12: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/12.jpg)
Latent Dirichlet Allocation
12
http://www.columbia.edu/~ih2240/dataviz/G4063-week5/images/text/LDA.png
Republic of Letters, University of Oxford23 March 2015
![Page 13: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/13.jpg)
Bakhtin’s Polyphonic Intertextuality
Voice I
Voice IIVoice III
Voice IVoice IIVoice III
In dialog
Text 1 Text 2 Text 3
Text 4
Text 1Text 2Text 3
In dialog in text 4
13Republic of Letters, University of Oxford23 March 2015
![Page 14: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/14.jpg)
Polyphony Appears in music (e.g. J.S.Bach) and in novels (Bakhtin)
The Polyphonic Model (Trausan-Matu, 2005, 2010)
Analysis method (Trausan-Matu, Dascalu and Rebedea, 2010)
Computer support tools for the polyphonic analysis ofconversations and networks of documents The “Polyphony” system (Trausan-Matu and all, 2007)
ASAP (Dascalu, Chioasca and Trausan-Matu, 2008)
PolyCAFe (Trausan-Matu, Rebedea and Dascalu, 2011; Rebedea, Dascalu,Trausan-Matu and all, 2010)
Collaboration regions detection (Banica, Trausan-Matu and Rebedea,2011)
Detection of the Important moments (Chiru and Trausan-Matu, 2012)
Intertextuality detection (Ghiban and Trausan-Matu, 2012)
ReaderBench (Dascălu, Trăușan-Matu and Dessus, 2013)14Republic of Letters, University of Oxford23 March 2015
![Page 15: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/15.jpg)
Intertextuality analysis
• Mikhail Bakhtin’s dialogistical and polyphonicmodel Intertextuality (Kristeva)
• Analyze how concepts are echoed from onetext to another (intertextual networks)
• To indicate membership to a philosophicaltrend or influences among authors
15Republic of Letters, University of Oxford23 March 2015
![Page 16: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/16.jpg)
Bakhtin’s Polyphonic Intertextuality
Theme 2 and Theme 3 mayhave the same words butonly different concepts
Section 1 and 6 are dialogical or polyphonical.They may present a higher force ofexpresivity.
16Republic of Letters, University of Oxford23 March 2015
![Page 17: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/17.jpg)
PolyCAFe(Trăușan-Matu, Dascălu and Rebedea)
• Polyphony-based Collaboration Analysis andFeedback generation
• Developed in the “Language Technologies forLifelong Learning” EU FP7 project(http://www.ltfll-project.org/)
• Analyses chat (instant messenger) logs withmore than two participants using thepolyphonic model (Trăușan-Matu)
17Republic of Letters, University of Oxford23 March 2015
![Page 18: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/18.jpg)
From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013
18Republic of Letters, University of Oxford23 March 2015
![Page 19: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/19.jpg)
PolyCAFe
From: Trăuşan-Matu , A Polyphonic Model for Interethnic Discourse, 2013
19Republic of Letters, University of Oxford23 March 2015
![Page 20: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/20.jpg)
ReaderBench(Dascalu, Trăușan-Matu and Dessus)
• Based on
– LSA, LDA
– Polyphonic model
– WordNet
– Social Network Analysis
23 March 2015 Republic of Letters, University of Oxford 20
(http://wordnet.princeton.edu)
![Page 21: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/21.jpg)
NLP Text pre-processing in PolyCAFe and ReaderBench
23 March 2015 Republic of Letters, University of Oxford
![Page 22: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/22.jpg)
ReaderBench Document view
23 March 2015 Republic of Letters, University of Oxford 22
![Page 23: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/23.jpg)
ReaderBench Concept View
23 March 2015 Republic of Letters, University of Oxford 23
![Page 24: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/24.jpg)
Concept View
24
![Page 25: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/25.jpg)
ReaderBench Corpus Similarity
23 March 2015 Republic of Letters, University of Oxford 25
![Page 26: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/26.jpg)
ReaderBench Document Centrality
26
![Page 27: Trausan-Matu: Natural Language Processing and Topic Modelling](https://reader036.fdocuments.us/reader036/viewer/2022081517/586690cf1a28ab78408b7ba1/html5/thumbnails/27.jpg)
Thank you!
Questions?
http://www.racai.ro/trausan
27Republic of Letters, University of Oxford23 March 2015