Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...

DataCamp NaturalLanguageProcessingFundamentalsinPython

Wordcountswithbag-of-words

NATURALLANGUAGEPROCESSINGFUNDAMENTALSINPYTHON

KatharineJarmulFounder,kjamistan

Bag-of-wordsBasicmethodforfindingtopicsinatextNeedtofirstcreatetokensusingtokenization...andthencountupallthetokensThemorefrequentaword,themoreimportantitmightbeCanbeagreatwaytodeterminethesignificantwordsinatext

Bag-of-wordsexample

Text:"Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."

Bagofwords(strippedpunctuation):

"The":3,"box":3"cat":3,"the":3"is":2"in":1,"likes":1,"over":1

Bag-of-wordsinPythonIn[1]:fromnltk.tokenizeimportword_tokenize

In[2]:fromcollectionsimportCounter

In[3]:Counter(word_tokenize("""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""))Out[3]:Counter({'.':3,'The':3,'box':3,'cat':3,'in':1,...'the':3})

In[4]:counter.most_common(2)Out[4]:[('The',3),('box',3)]

Let'spractice!

Simpletextpreprocessing

Whypreprocess?Helpsmakeforbetterinputdata

WhenperformingmachinelearningorotherstatisticalmethodsExamples:

TokenizationtocreateabagofwordsLowercasingwords

Lemmatization/StemmingShortenwordstotheirrootstems

Removingstopwords,punctuation,orunwantedtokensGoodtoexperimentwithdifferentapproaches

Preprocessingexample

Inputtext:Cats,dogsandbirdsarecommonpets.Soarefish.

Outputtokens:cat,dog,bird,common,pet,fish

TextpreprocessingwithPythonIn[1]:fromntlk.corpusimportstopwords

In[2]:text="""Thecatisinthebox.Thecatlikesthebox.Theboxisoverthecat."""

In[3]:tokens=[wforwinword_tokenize(text.lower())ifw.isalpha()]

In[4]:no_stops=[tfortintokensiftnotinstopwords.words('english')]

In[5]:Counter(no_stops).most_common(2)Out[5]:[('cat',3),('box',3)]

Let'spractice!

Introductiontogensim

Whatisgensim?Popularopen-sourceNLPlibraryUsestopacademicmodelstoperformcomplextasks

BuildingdocumentorwordvectorsPerformingtopicidentificationanddocumentcomparison

Whatisawordvector?

GensimExample

(Source:)

http://tlfvincent.github.io/2015/10/23/presidential-speech-topics

CreatingagensimdictionaryIn[1]:fromgensim.corpora.dictionaryimportDictionary

In[2]:fromnltk.tokenizeimportword_tokenize

In[3]:my_documents=['Themoviewasaboutaspaceshipandaliens.',...:'Ireallylikedthemovie!',...:'Awesomeactionscenes,butboringcharacters.',...:'Themoviewasawful!Ihatealienfilms.',...:'Spaceiscool!Ilikedthemovie.',...:'Morespacefilms,please!',]

In[4]:tokenized_docs=[word_tokenize(doc.lower())...:fordocinmy_documents]

In[5]:dictionary=Dictionary(tokenized_docs)

In[6]:dictionary.token2idOut[6]:{'!':11,',':17,'.':7,'a':2,'about':4,...}

Creatingagensimcorpus

gensimmodelscanbeeasilysaved,updated,andreused

OurdictionarycanalsobeupdatedThismoreadvancedandfeaturerichbag-of-wordscanbeusedinfutureexercises

In[7]:corpus=[dictionary.doc2bow(doc)fordocintokenized_docs]

In[8]:corpusOut[8]:[[(0,1),(1,1),(2,1),(3,1),(4,1),(5,1),(6,1),(7,1),(8,1)],[(0,1),(1,1),(9,1),(10,1),(11,1),(12,1)],...]

Let'spractice!

Tf-idfwithgensim

Whatistf-idf?Termfrequency-inversedocumentfrequencyAllowsyoutodeterminethemostimportantwordsineachdocumentEachcorpusmayhavesharedwordsbeyondjuststopwordsThesewordsshouldbedown-weightedinimportanceExamplefromastronomy:"Sky"Ensuresmostcommonwordsdon'tshowupaskeywordsKeepsdocumentspecificfrequentwordsweightedhigh

Tf-idfformula

w = tf ∗ log( )

w = tf-idfweightfortokeniindocumentj

tf = numberofoccurencesoftokeniindocumentj

df = numberofdocumentsthatcontaintokeni

N = totalnumberofdocuments

i,j i,jdfi

Tf-idfwithgensimIn[10]:fromgensim.models.tfidfmodelimportTfidfModel

In[11]:tfidf=TfidfModel(corpus)

In[12]:tfidf[corpus[1]]Out[12]:[(0,0.1746298276735174),(1,0.1746298276735174),(9,0.29853166221463673),(10,0.7716931521027908),...]

Let'spractice!

Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...

Documents

Transcript of Word counts with bag- of-words - Amazon S3 · 2017-08-14 · DataCamp Natural Language Processing...

Datacamp ETL Documentation

An Introduction to Web Scraping with Python and DataCamp€¦ · 23.02.2018 · An Introduction to Web Scraping with Python and DataCamp Author: Olga Scrivner, Research Scientist,

IBuILD: Incremental Bag of Binary Words for Appearance Based …mediatum.ub.tum.de/doc/1272214/165725.pdf · IBuILD: Incremental Bag of Binary Words for Appearance Based Loop Closure

Bag -of -words -based anomaly -detection principal ... · Bag -of -words -based anomaly -detection principal component analysis and stochastic optimization for debris flow detection

Bag-of-Words based Image Classification

IBuILD: Incremental Bag of Binary Words for Appearance Based …mediatum.ub.tum.de/doc/1272214/1272214.pdf · 2015-07-06 · IBuILD: Incremental Bag of Binary Words for Appearance

A Hamming Embedding Kernel with Informative Bag-of-Visual-Words for Video …vireo.cs.cityu.edu.hk/papers/tomccap14-fwang.pdf · A Hamming Embedding Kernel with Informative Bag-of-Visual-Words

DataCamp Customer Roadmapassets.datacamp.com/email/other/DataCamp+Customer+Roadmap+… · Beginner / refresher Course ... Quantitative Risk Management, Value-at-risk, **** Technologies:

datacamp course presentation - École Polytechnique

Discriminant Bag of Words based Representation for Human ... · PDF fileon Bag of Words (BoWs) action representation, that uniﬁes discriminative code-book generation and discriminant

Local Features and Bag of Words Models

Bag-of-Words Models and Beyond

Bag-of-words models CS4670 / 5670: Computer Vision Noah Snavely Object Bag of ‘words’

IntroR - California State University, East Baycox.csueastbay.edu/.../ml02a/IntroR_ver02.pdf · Introduction to R: DataCamp Code School I DataCamp I IntroductiontoR. Title: IntroR

Local Learning to Improve Bag of Visual Words - Deep Learning

Datacamp User Guide

Lecture #14: Visual Bag of Words

Lecture 28: Bag-of-words models CS4670: Computer Vision Noah Snavely Object Bag of ‘words’

Cvpr2007 object category recognition p1 - bag of words models

Document Image Retrieval using Bag of Visual Words Model