An Introduction to gensim: "Topic Modelling for Humans"
-
Upload
sandinmyjoints -
Category
Technology
-
view
12.692 -
download
1
Transcript of An Introduction to gensim: "Topic Modelling for Humans"
![Page 1: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/1.jpg)
gensimtopic modeling for humans
William BertDC Python Meetup1 May 2012
![Page 2: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/2.jpg)
please go to http://ADDRESS and enter a sentence
interesting relationships?
gensim generated the data for those visualizations by computing the semantic similarity of the input
![Page 3: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/3.jpg)
who am I?
William Bertdeveloper at Carney Labs (teamcarney.com)user of gensimstill new to world of topic modelling, semantic similarity, etc
![Page 4: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/4.jpg)
gensim: “topic modeling for humans”
topic modeling attempts to uncover the underlying semantic structure of by identifying recurring patterns of terms in a set of data (topics).
topic modelling does not parse sentences,does not care about word order, anddoes not "understand" grammar or syntax.
![Page 5: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/5.jpg)
gensim: “topic modeling for humans”>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',
'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',
'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',
![Page 6: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/6.jpg)
gensim isn't about topic modeling(for me, anyway)It's about similarity.
What is similarity?Some types:• String matching• Stylometry • Term frequency• Semantic (meaning)
![Page 7: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/7.jpg)
Is A seven-year quest to collect samples from the solar system's formation ended in triumph in a dark and wet Utah desert this weekend.similar in meaning to For a month, a huge storm with massive lightning has been raging on Jupiter under the watchful eye of an orbiting spacecraft. more or less than it is similar to One of Saturn's moons is spewing a giant plume of water vapour that is feeding the planet's rings, scientists say.?
![Page 8: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/8.jpg)
Who cares about semantic similarity?
Some use cases:• Query large collections of text• Automatic metadata• Recommendations• Better human-computer interaction
![Page 9: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/9.jpg)
gensim.corpora
corpus = stream of vectors of document feature idsfor example, words in documents are features (“bucket of words”)
TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
![Page 10: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/10.jpg)
Dictionary class
dictionary maps features (words) to feature ids (numbers)
TextCorpus and other kinds of corpus classes>>> corpus = TextCorpus(file_like_object)>>> [doc for doc in corpus][[(40, 1), (6, 1), (78, 2)], [(39, 1), (58, 1),...]
>>> print corpus.dictionaryDictionary(8472 unique tokens)
gensim.corpora
![Page 11: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/11.jpg)
need massive collection of documents that ostensibly has meaningsounds like a job for wikipedia>>> wiki_corpus = WikiCorpus(articles) # articles is Wikipedia text dump bz2 file. several hours.
>>> wiki_corpus.dictionary.save("wiki_dict.dict") # persist dictionary
>>> MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # uses numpy to persist corpus in Matrix Market format. several GBs. can be BZ2’ed.
>>> wiki_corpus = MmCorpus("wiki_corpus.mm") # revive a corpus
![Page 12: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/12.jpg)
gensim.models
transform corpora using models classesfor example, term frequency/inverse document frequency (TFIDF) transformationreflects importance of a term, not just presence/absence
![Page 13: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/13.jpg)
>>> tfidf_trans = models.TfidfModel(wiki_corpus, id2word=dictionary) # TFIDF computes frequencies of all document features in the corpus. several hours.TfidfModel(num_docs=3430645, num_nnz=547534266)
gensim.models
>>> tfidf_trans[documents] # emits documents in TFIDF representation. documents must be in the same BOW vector space as wiki_corpus. [[(40, 0.23), (6, 0.12), (78, 0.65)], [(39, ...]>>> tfidf_corpus = MmCorpus(corpus=tfidf_trans [wiki_corpus], id2word=dictionary) # builds new corpus by iterating over documents transformed to TFIDF
![Page 14: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/14.jpg)
>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400) # creates LSI transformation model from tfidf corpus representation
gensim.models
![Page 15: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/15.jpg)
topics again for a bit>>> lsi_model.show_topics()'-0.203*"smith" + 0.166*"jan" + 0.132*"soccer" + 0.132*"software" + 0.119*"fort" + -0.119*"nov" + 0.116*"miss" + -0.114*"opera" + -0.112*"oct" + -0.105*"water"',
'0.179*"squadron" + 0.158*"smith" + -0.140*"creek" + 0.135*"chess" + -0.130*"air" + 0.128*"en" + -0.122*"nov" + -0.120*"fr" + 0.119*"jan" + -0.115*"wales"',
'0.373*"jan" + -0.236*"chess" + -0.234*"nov" + -0.208*"oct" + 0.151*"dec" + -0.106*"pennsylvania" + 0.096*"view" + -0.092*"fort" + -0.091*"feb" + -0.090*"engineering"',
![Page 16: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/16.jpg)
topics again for a bit• SVD decomposes a matrix into three simpler matrices • full rank SVD would be able to recreate the underlying matrix exactly from those three matrices• lower-rank SVD provides the best (least square error) approximation of the matrix • this approximation can find interesting relationships among data• it preserves most information while reducing noise and merging dimensions associated with terms that have similar meanings
![Page 17: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/17.jpg)
topics again for a bit• SVD: alias-i.com/lingpipe/demos/tutorial/svd/read-me.html
•Original paper: www.cob.unt.edu/itds/faculty/evangelopoulos/dsci5910/LSA_Deerwester1990.pdf
• General explanation: tottdp.googlecode.com/files/LandauerFoltz-Laham1998.pdf
• Many more
![Page 18: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/18.jpg)
>>> lsi_trans = models.LsiModel(corpus=tfidf_corpus, id2word=dictionary, num_features=400, decay=1.0, chunksize=20000) # creates LSI transformation model from tfidf corpus representation
>>> print lsi_transLsiModel(num_terms=100000, num_topics=400, decay=1.0, chunksize=20000)
gensim.models
![Page 19: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/19.jpg)
gensim.similarities(the best part)>>> index = Similarity(corpus=lsi_transformation[tfidf_transformation[index_corpus]], num_features=400, output_prefix=”/tmp/shard”)>>> index[lsi_trans[tfidf_trans[dictionary.doc2bow(tokenize(query))]]] # similarity of each document in the index corpus to a new query document>>> [s for s in index] # a matrix of each document’s similarities to all other documents[array([ 1. , 0. , 0.08, 0.01]), array([ 0. , 1. , 0.02, -0.02]), array([ 0.08, 0.02, 1. , 0.15]), array([ 0.01, -0.02, 0.15, 1. ])]
![Page 20: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/20.jpg)
about gensim
created by Radim Rehurek• radimrehurek.com/gensim• github.com/piskvorky/gensim• groups.google.com/group/gensim
four additional models available
dependencies:numpyscipy
optional:PyroPattern
![Page 21: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/21.jpg)
example code, visualization code, and ppt:github.com/sandinmyjoints
interview with Radim:williamjohnbert.com
thank you
![Page 22: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/22.jpg)
(additional slides)
![Page 23: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/23.jpg)
• term frequency/inverse document frequency (TFIDF)• log entropy• random projections• latent dirichlet allocation (LDA)• hierarchical dirichlet process (HDP)• latent semantic analysis/indexing (LSA/LSI)
gensim.models
![Page 24: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/24.jpg)
Dependencies: numpy and scipy, and optionally Pyro for distributed and Pattern for lemmatization
data from Lee 2005 and other papers is available in gensim for tests
slightly more about gensim
![Page 25: An Introduction to gensim: "Topic Modelling for Humans"](https://reader035.fdocuments.us/reader035/viewer/2022070512/588afe901a28abf8548b68c7/html5/thumbnails/25.jpg)
>>> lda_model.show_topics()['0.083*bridge + 0.034*dam + 0.034*river + 0.027*canal + 0.026*construction + 0.014*ferry + 0.013*bridges + 0.013*tunnel + 0.012*trail + 0.012*reservoir', '0.044*fight + 0.029*bout + 0.029*via + 0.028*martial + 0.025*boxing + 0.024*submission + 0.021*loss + 0.021*mixed + 0.020*arts + 0.020*fighting', '0.086*italian + 0.062*italy + 0.048*di + 0.024*milan + 0.019*rome + 0.014*venice + 0.013*giovanni + 0.012*della + 0.011*florence + 0.011*francesco’]
gensim: “topic modelling for humans”