Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...
Transcript of Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...
![Page 1: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/1.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Wordcountswithbag-of-words
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
![Page 2: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/2.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsBasicmethodforfindingtopicsinatextNeedtofirstcreatetokensusingtokenization...andthencountupallthetokensThemorefrequentaword,themoreimportantitmightbeCanbeagreatwaytodeterminethesignificantwordsinatext
![Page 3: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/3.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsexample
Text:"Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."
Bagofwords(strippedpunctuation):
"The":3,"box":3"cat":3,"the":3"is":2"in":1,"likes":1,"over":1
![Page 4: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/4.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Bag-of-wordsinPythonIn[1]:fromnltk.tokenizeimportword_tokenize
In[2]:fromcollectionsimportCounter
In[3]:Counter(word_tokenize("""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""))Out[3]:Counter({'.':3,'The':3,'box':3,'cat':3,'in':1,...'the':3})
In[4]:counter.most_common(2)Out[4]:[('The',3),('box',3)]
![Page 5: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/5.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
![Page 6: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/6.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Simpletextpreprocessing
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
![Page 7: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/7.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whypreprocess?Helpsmakeforbetterinputdata
WhenperformingmachinelearningorotherstatisticalmethodsExamples:
TokenizationtocreateabagofwordsLowercasingwords
Lemmatization/StemmingShortenwordstotheirrootstems
Removingstopwords,punctuation,orunwantedtokensGoodtoexperimentwithdifferentapproaches
![Page 8: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/8.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Preprocessingexample
Inputtext:Cats,dogsandbirdsarecommonpets.Soarefish.
Outputtokens:cat,dog,bird,common,pet,fish
![Page 9: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/9.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
TextpreprocessingwithPythonIn[1]:fromntlk.corpusimportstopwords
In[2]:text="""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""
In[3]:tokens=[wforwinword_tokenize(text.lower())ifw.isalpha()]
In[4]:no_stops=[tfortintokensiftnotinstopwords.words('english')]
In[5]:Counter(no_stops).most_common(2)Out[5]:[('cat',3),('box',3)]
![Page 10: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/10.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
![Page 11: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/11.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Introductiontogensim
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
![Page 12: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/12.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatisgensim?Popularopen-sourceNLPlibraryUsestopacademicmodelstoperformcomplextasks
BuildingdocumentorwordvectorsPerformingtopicidentificationanddocumentcomparison
![Page 13: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/13.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatisawordvector?
![Page 14: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/14.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
GensimExample
(Source:)
http://tlfvincent.github.io/2015/10/23/presidential-speech-topics
![Page 15: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/15.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
CreatingagensimdictionaryIn[1]:fromgensim.corpora.dictionaryimportDictionary
In[2]:fromnltk.tokenizeimportword_tokenize
In[3]:my_documents=['Themoviewasaboutaspaceshipandaliens.',...:'Ireallylikedthemovie!',...:'Awesomeactionscenes,butboringcharacters.',...:'Themoviewasawful!Ihatealienfilms.',...:'Spaceiscool!Ilikedthemovie.',...:'Morespacefilms,please!',]
In[4]:tokenized_docs=[word_tokenize(doc.lower())...:fordocinmy_documents]
In[5]:dictionary=Dictionary(tokenized_docs)
In[6]:dictionary.token2idOut[6]:{'!':11,',':17,'.':7,'a':2,'about':4,...}
![Page 16: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/16.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Creatingagensimcorpus
gensimmodelscanbeeasilysaved,updated,andreused
OurdictionarycanalsobeupdatedThismoreadvancedandfeaturerichbag-of-wordscanbeusedinfutureexercises
In[7]:corpus=[dictionary.doc2bow(doc)fordocintokenized_docs]
In[8]:corpusOut[8]:[[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1)],[(0,1),(1,1),(9,1),(10,1),(11,1),(12,1)],...]
![Page 17: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/17.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
![Page 18: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/18.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfwithgensim
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON
KatharineJarmulFounder,kjamistan
![Page 19: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/19.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Whatistf-idf?Termfrequency-inversedocumentfrequencyAllowsyoutodeterminethemostimportantwordsineachdocumentEachcorpusmayhavesharedwordsbeyondjuststopwordsThesewordsshouldbedown-weightedinimportanceExamplefromastronomy:"Sky"Ensuresmostcommonwordsdon'tshowupaskeywordsKeepsdocumentspecificfrequentwordsweightedhigh
![Page 20: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/20.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfformula
w = tf ∗ log( )
w = tf-idfweightfortokeniindocumentj
tf = numberofoccurencesoftokeniindocumentj
df = numberofdocumentsthatcontaintokeni
N = totalnumberofdocuments
i,j i,jdfi
N
i,j
i,
i
![Page 21: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/21.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Tf-idfwithgensimIn[10]:fromgensim.models.tfidfmodelimportTfidfModel
In[11]:tfidf=TfidfModel(corpus)
In[12]:tfidf[corpus[1]]Out[12]:[(0,0.1746298276735174),(1,0.1746298276735174),(9,0.29853166221463673),(10,0.7716931521027908),...]
![Page 22: Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing Fundamentals in Python Bag-of-words Basic method for finding topics in a text Need](https://reader034.fdocuments.us/reader034/viewer/2022042114/5e918f3d84603f21e54ce660/html5/thumbnails/22.jpg)
DataCamp NaturalLanguageProcessingFundamentalsinPython
Let'spractice!
NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON