Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...

Post on 05-Apr-2020

0 views 0 download

Transcript of Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...

DataCamp NaturalLanguageProcessingFundamentalsinPython

Wordcountswithbag-of-words

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsBasicmethodforfindingtopicsinatextNeedtofirstcreatetokensusingtokenization...andthencountupallthetokensThemorefrequentaword,themoreimportantitmightbeCanbeagreatwaytodeterminethesignificantwordsinatext

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsexample

Text:"Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."

Bagofwords(strippedpunctuation):

"The":3,"box":3"cat":3,"the":3"is":2"in":1,"likes":1,"over":1

DataCamp NaturalLanguageProcessingFundamentalsinPython

Bag-of-wordsinPythonIn[1]:fromnltk.tokenizeimportword_tokenize

In[2]:fromcollectionsimportCounter

In[3]:Counter(word_tokenize("""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""))Out[3]:Counter({'.':3,'The':3,'box':3,'cat':3,'in':1,...'the':3})

In[4]:counter.most_common(2)Out[4]:[('The',3),('box',3)]

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Simpletextpreprocessing

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whypreprocess?Helpsmakeforbetterinputdata

WhenperformingmachinelearningorotherstatisticalmethodsExamples:

TokenizationtocreateabagofwordsLowercasingwords

Lemmatization/StemmingShortenwordstotheirrootstems

Removingstopwords,punctuation,orunwantedtokensGoodtoexperimentwithdifferentapproaches

DataCamp NaturalLanguageProcessingFundamentalsinPython

Preprocessingexample

Inputtext:Cats,dogsandbirdsarecommonpets.Soarefish.

Outputtokens:cat,dog,bird,common,pet,fish

DataCamp NaturalLanguageProcessingFundamentalsinPython

TextpreprocessingwithPythonIn[1]:fromntlk.corpusimportstopwords

In[2]:text="""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""

In[3]:tokens=[wforwinword_tokenize(text.lower())ifw.isalpha()]

In[4]:no_stops=[tfortintokensiftnotinstopwords.words('english')]

In[5]:Counter(no_stops).most_common(2)Out[5]:[('cat',3),('box',3)]

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Introductiontogensim

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatisgensim?Popularopen-sourceNLPlibraryUsestopacademicmodelstoperformcomplextasks

BuildingdocumentorwordvectorsPerformingtopicidentificationanddocumentcomparison

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatisawordvector?

DataCamp NaturalLanguageProcessingFundamentalsinPython

GensimExample

(Source:)

http://tlfvincent.github.io/2015/10/23/presidential-speech-topics

DataCamp NaturalLanguageProcessingFundamentalsinPython

CreatingagensimdictionaryIn[1]:fromgensim.corpora.dictionaryimportDictionary

In[2]:fromnltk.tokenizeimportword_tokenize

In[3]:my_documents=['Themoviewasaboutaspaceshipandaliens.',...:'Ireallylikedthemovie!',...:'Awesomeactionscenes,butboringcharacters.',...:'Themoviewasawful!Ihatealienfilms.',...:'Spaceiscool!Ilikedthemovie.',...:'Morespacefilms,please!',]

In[4]:tokenized_docs=[word_tokenize(doc.lower())...:fordocinmy_documents]

In[5]:dictionary=Dictionary(tokenized_docs)

In[6]:dictionary.token2idOut[6]:{'!':11,',':17,'.':7,'a':2,'about':4,...}

DataCamp NaturalLanguageProcessingFundamentalsinPython

Creatingagensimcorpus

gensimmodelscanbeeasilysaved,updated,andreused

OurdictionarycanalsobeupdatedThismoreadvancedandfeaturerichbag-of-wordscanbeusedinfutureexercises

In[7]:corpus=[dictionary.doc2bow(doc)fordocintokenized_docs]

In[8]:corpusOut[8]:[[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1)],[(0,1),(1,1),(9,1),(10,1),(11,1),(12,1)],...]

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfwithgensim

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

DataCamp NaturalLanguageProcessingFundamentalsinPython

Whatistf-idf?Termfrequency-inversedocumentfrequencyAllowsyoutodeterminethemostimportantwordsineachdocumentEachcorpusmayhavesharedwordsbeyondjuststopwordsThesewordsshouldbedown-weightedinimportanceExamplefromastronomy:"Sky"Ensuresmostcommonwordsdon'tshowupaskeywordsKeepsdocumentspecificfrequentwordsweightedhigh

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfformula

w = tf ∗ log( )

w = tf-idfweightfortokeniindocumentj

tf = numberofoccurencesoftokeniindocumentj

df = numberofdocumentsthatcontaintokeni

N = totalnumberofdocuments

i,j i,jdfi

N

i,j

i,

i

DataCamp NaturalLanguageProcessingFundamentalsinPython

Tf-idfwithgensimIn[10]:fromgensim.models.tfidfmodelimportTfidfModel

In[11]:tfidf=TfidfModel(corpus)

In[12]:tfidf[corpus[1]]Out[12]:[(0,0.1746298276735174),(1,0.1746298276735174),(9,0.29853166221463673),(10,0.7716931521027908),...]

DataCamp NaturalLanguageProcessingFundamentalsinPython

Let'spractice!

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON