PLOTCON NYC: Text is data! Analysis and Visualization Methods
-
Upload
plotly -
Category
Data & Analytics
-
view
90 -
download
1
Transcript of PLOTCON NYC: Text is data! Analysis and Visualization Methods
![Page 1: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/1.jpg)
Text Analysis and Visualization
Irene Ros@ireneroshttp://bocoup.com/datavis
![Page 2: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/2.jpg)
![Page 3: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/3.jpg)
Bocoup Datavis Teamhttp://bocoup.com/datavis
Data Science & Visualization Design & Application Development
![Page 5: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/5.jpg)
WHY TEXT?
![Page 6: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/6.jpg)
http://www.nbcnews.com/politics/2016-election/trump-shocks-awes-final-new-hampshire-rally-primary-n514266
![Page 7: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/7.jpg)
TEXT IS DATA TOO
Document Collections
![Page 8: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/8.jpg)
SINGLE DOCUMENTMeasurementsClean upStructureWord Relationships
![Page 9: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/9.jpg)
Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversations?'
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
There was nothing so very remarkable in that; nor did Alice think it so very much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually took a watch out of its waistcoat-pocket, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
https://www.gutenberg.org/files/11/11-h/11-h.htm
![Page 10: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/10.jpg)
MEASUREMENTSBASIC COUNTS
basic units of text analysis...
![Page 11: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/11.jpg)
13 the11 it9 to8 of8 her8 a7 she7 and5 was5 rabbit4 very4 or4 in4 alice3 with3 when3 that3 so3 out3 had3 but3 at3 '2 watch2 waistcoat2 time2 thought2 sister2 ran2 pocket2 pictures2 on2 nothing2 mind2 for2 dear2 conversations2 by2 book2 be2 as2 across1 would1 worth1 wondered1 white1 whether1 what1 well1 way1 use1 up1 under1 twice1 trouble1 took1 tired1 this1 think1 there1 then1 take1 suddenly1 stupid1 started1 sleepy1 sitting1 shall1 seen1 seemed1 see1 say1 remarkable1 reading1 quite1 pop1 pleasure1 pink1 picking1 peeped1 own1 over1 ought1 once1 oh1 occurred1 nor1 no1 never1 natural1 much1 making1 made1 looked1 late1 large1 just1 itself1 its1 is1 into1 i1 hurried1 hot1 hole1 hedge1 hear1 having1 have1 getting1 get1 fortunately1 flashed1 field1 feet1 feel1 eyes1 either1 down1 do1 did1 day1 daisy1 daisies1 curiosity1 could1 considering1 close1 chain1 burning1 beginning1 before1 bank1 all1 afterwards1 after1 actually1 'without1 'oh1 'and
2 for2 dear2 conversations2 by2 book2 be2 as2 across1 would1 worth1 wondered1 white1 whether1 what1 well1 way1 use1 up1 under1 twice1 trouble1 took1 tired1 this1 think1 there1 then1 take1 suddenly1 stupid1 started1 sleepy1 sitting1 shall1 seen1 seemed1 see1 say1 remarkable1 reading1 quite1 pop1 pleasure1 pink1 picking1 peeped1 own1 over1 ought1 once1 oh1 occurred1 nor1 no1 never1 natural1 much1 making1 made1 looked1 late1 large1 just1 itself1 its1 is1 into1 i1 hurried1 hot1 hole1 hedge1 hear1 having1 have1 getting1 get1 fortunately1 flashed1 field1 feet1 feel1 eyes1 either1 down1 do1 did1 day1 daisy1 daisies1 curiosity1 could1 considering1 close1 chain1 burning1 beginning1 before1 bank1 all1 afterwards1 after1 actually1 'without1 'oh1 'and
1 seen1 seemed1 see1 say1 remarkable1 reading1 quite1 pop1 pleasure1 pink1 picking1 peeped1 own1 over1 ought1 once1 oh1 occurred1 nor1 no1 never1 natural1 much1 making1 made1 looked1 late1 large1 just1 itself1 its1 is1 into1 i1 hurried1 hot1 hole1 hedge1 hear1 having1 have1 getting1 get1 fortunately1 flashed1 field1 feet1 feel1 eyes1 either1 down1 do1 did1 day1 daisy1 daisies1 curiosity1 could1 considering1 close1 chain1 burning1 beginning1 before1 bank1 all1 afterwards1 after1 actually1 'without1 'oh1 'and
1 hurried1 hot1 hole1 hedge1 hear1 having1 have1 getting1 get1 fortunately1 flashed1 field1 feet1 feel1 eyes1 either1 down1 do1 did1 day1 daisy1 daisies1 curiosity1 could1 considering1 close1 chain1 burning1 beginning1 before1 bank1 all1 afterwards1 after1 actually1 'without1 'oh1 'and
1 after1 actually1 'without1 'oh1 'and
![Page 13: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/13.jpg)
http://graphics.wsj.com/elections/2016/democratic-debate-charts/
![Page 14: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/14.jpg)
CODE TIME!pyton + nltk
![Page 15: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/15.jpg)
import nltkfrom collections import Counter
tokens = nltk.word_tokenize(text)counts = Counter(tokens)sorted_counts = sorted(counts.items(), key=lambda count: count[1], reverse=True)
sorted_counts
[(',', 2418), ('the', 1516), ("'", 1129), ('.', 975), ('and', 757), ('to', 717), ('a', 612), ('it', 513), ('she', 507), ('of', 496), ('said', 456), ('!', 450), ('Alice', 394), ('I', 374),...]
![Page 16: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/16.jpg)
Remove punctuationRemove stop wordsNormalize the caseRemove fragmentsStemming
CLEAN-UPslight diversion...
![Page 17: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/17.jpg)
REMOVE PUNCTUATION# starting point for punctuation from python string# punctuation is '!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~-'from string import punctuation
def remove_tokens(tokens, remove_tokens): return [token for token in tokens if token not in remove_tokens]
no_punc_tokens = remove_tokens(tokens, punctuation)
['CHAPTER', 'I', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to', 'get',...]
['CHAPTER', 'I', '.', 'Down', 'the', 'Rabbit-Hole', 'Alice', 'was', 'beginning', 'to',...] before
![Page 18: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/18.jpg)
NORMALIZE CASE# downcase every tokendef lowercase(tokens): return [token.lower() for token in tokens]
lowercase(no_punc_tokens)
['chapter', 'i', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting',...]
['CHAPTER', 'I', 'Down', 'Rabbit-Hole', 'Alice', 'beginning', 'get', 'tired', 'sitting',...] before
![Page 19: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/19.jpg)
REMOVE STOP WORDS
# import stopwords from nltkfrom nltk.corpus import stopwordsstops = stopwords.words('english')
# stop words look like:# [u'i', u'my', u'myself', u'we', u'our', # u'ours', u'you'...]filtered_tokens = remove_tokens(no_punc_tokens, stops)
before
['chapter', 'i', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting',...]
['chapter', 'down', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister',...]
![Page 20: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/20.jpg)
REMOVE FRAGMENTS
# Removes fragmented words like: n't, 'sdef remove_word_fragments(tokens): return [token for token in tokens if "'" not in token]
no_frag_tokens = remove_word_fragments(filtered_tokens)
before
['chapter', 'down', 'rabbit-hole', 'beginning', 'tired', 'sitting', 'sister',...]
['chapter', 'down', 'rabbit-hole', 'n't', 'beginning', ''s', 'tired', 'sitting', 'sister',...]
![Page 21: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/21.jpg)
STEMMINGConverts words to their 'base' form, for example:
regular = ['house', 'housing', 'housed']stemmed = ['hous', 'hous', 'hous']
from nltk.stem import PorterStemmerstemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(token) for token in no_frag_tokens]
['chapter', 'rabbit-hol', 'alic', 'begin', 'get', 'tire', 'sit', 'sister', 'bank', 'noth',...]
['chapter', 'rabbit-hole', 'alice', 'beginning', 'get', 'tired', 'sitting', 'sister', 'bank', 'nothing',...]
before
![Page 22: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/22.jpg)
IMPROVED COUNTS:
[('said', 462), ('alic', 396), ('littl', 128), ('look', 102), ('one', 100), ('like', 97), ('know', 91), ('would', 90), ('could', 86), ('went', 83), ('thought', 80), ('thing', 79), ('queen', 76), ('go', 75), ('time', 74), ('say', 70), ('see', 68), ('get', 66), ('king', 64),...]
[(',', 2418), ('the', 1516), ("'", 1129), ('.', 975), ('and', 757), ('to', 717), ('a', 612), ('it', 513), ('she', 507), ('of', 496), ('said', 456), ('!', 450), ('Alice', 394), ('I', 374), ('was', 362), ('in', 351), ('you', 337), ('that', 267), ('--', 264),...] before
![Page 23: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/23.jpg)
STRUCTUREPART-OF-SPEECH TAGGING
back to our regular format...
![Page 24: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/24.jpg)
PART OF SPEECH TAGGING
http://cogcomp.cs.illinois.edu/page/demo_view/pos
![Page 25: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/25.jpg)
http://cogcomp.cs.illinois.edu/page/demo_view/pos
![Page 26: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/26.jpg)
http://tvtropes.org/pmwiki/pmwiki.php/Main/DamselInDistress
![Page 27: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/27.jpg)
![Page 28: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/28.jpg)
http://stereotropes.bocoup.com/tropes/DamselInDistress
![Page 29: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/29.jpg)
http://stereotropes.bocoup.com/gender
![Page 30: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/30.jpg)
POS-TAGGINGPOS tag your raw tokens - punctuation and capitalization matter
tagged = nltk.pos_tag(tokens)
[('CHAPTER', 'NN'), ('I', 'PRP'), ('.', '.'), ('Down', 'RP'), ('the', 'DT'), ('Rabbit-Hole', 'JJ'), ('Alice', 'NNP'), ('was', 'VBD'), ('beginning', 'VBG'), ('to', 'TO'), ('get', 'VB'), ('very', 'RB'), ('tired', 'JJ'), ('of', 'IN'), ('sitting', 'VBG'), ('by', 'IN'), ('her', 'PRP$'), ('sister', 'NN'),...]
![Page 31: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/31.jpg)
WORD RELATIONSHIPSCONCORDANCE, N-GRAMS, CO-OCCURRENCE
![Page 32: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/32.jpg)
CONCORDANCE
my_text = nltk.Text(tokens)my_text.concordance('Alice')
Alice was beginning to get very tired of shat is the use of a book , ' thought Alice 'without pictures or conversations ?so VERY remarkable in that ; nor did Alice think it so VERY much out of the waylooked at it , and then hurried on , Alice started to her feet , for it flashed hedge . In another moment down went Alice after it , never once considering hoped suddenly down , so suddenly that Alice had not a moment to think about stopshe fell past it . 'Well ! ' thought Alice to herself , 'after such a fall as town , I think -- ' ( for , you see , Alice had learnt several things of this sotude or Longitude I 've got to ? ' ( Alice had no idea what Latitude was , or L . There was nothing else to do , so Alice soon began talking again . 'Dinah 'lats eat bats , I wonder ? ' And here Alice began to get rather sleepy , and wendry leaves , and the fall was over . Alice was not a bit hurt , and she jumped not a moment to be lost : away went Alice like the wind , and was just in time but they were all locked ; and when Alice had been all the way down one side a
KEYWORD IN CONTEXT
![Page 33: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/33.jpg)
CONCORDANCE
![Page 34: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/34.jpg)
https://www.washingtonpost.com/graphics/politics/2016-election/debates/oct-13-speakers/
CONCORDANCE
![Page 35: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/35.jpg)
CONCORDANCE: WORD TREE
https://www.jasondavies.com/wordtree/?source=alice-in-wonderland.txt&prefix=She
![Page 36: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/36.jpg)
http://www.chrisharrison.net/index.php/Visualizations/WordSpectrum
Visualizing Google's Bi-Gram Data
N-GRAMS (COLLOCATIONS)
![Page 37: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/37.jpg)
http://www.chrisharrison.net/index.php/Visualizations/WordAssociations
N-GRAMS (COLLOCATIONS)
![Page 38: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/38.jpg)
N-GRAMS (COLLOCATIONS)A set of words that occur together more often then chance.
from nltk.collocations import BigramCollocationFinderfinder = BigramCollocationFinder.from_words(filtered_tokens)
# built in bigram metrics are in herebigram_measures = nltk.collocations.BigramAssocMeasures()
# we call score_ngrams on the finder to produce a sorted list# of bigrams. Each comes with its score from the metric, which# is how they are sorted. finder.score_ngrams(bigram_measures.raw_freq)
[(('said', 'the'), 0.007674512539933169), (('of', 'the'), 0.004700179928762898), (('said', 'alice'), 0.004259538060441376), (('in', 'a'), 0.0035618551022656335), (('and', 'the'), 0.002900892299783351),...]
![Page 39: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/39.jpg)
N-GRAMS (COLLOCATIONS)A set of words that occur together more often than chance.
finder.score_ngrams(bigram_measures.likelihood_ratio)
[(('mock', 'turtle'), 781.0917141765854), (('said', 'the'), 597.9581706687363), (('said', 'alice'), 505.46971076855675), (('march', 'hare'), 461.91931122768904), (('went', 'on'), 376.6417465508724), (('do', "n't"), 372.7029564560615), (('the', 'queen'), 351.39319634691446), (('the', 'king'), 342.27277302768084), (('in', 'a'), 341.4084817025905), (('the', 'gryphon'), 278.40108569878106),...]
![Page 40: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/40.jpg)
Phrase Net: X begat Y
CO-OCCURENCE
Frank Van Ham, http://ieeexplore.ieee.org/ieee_pilot/articles/06/ttg2009061169/article.html#article
![Page 41: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/41.jpg)
Phrase Net: X of Y
old testament new testament
![Page 42: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/42.jpg)
CO-OCCURENCE
my_text = nltk.Text(tokens)my_text.findall('<.*><of><.*>')
tired of sitting; and of having; use of a; pleasure of making; troubleof getting; out of the; out of it; plenty of time; sides of the; oneof the; fear of killing; one of the; nothing of tumbling; top of the;centre of the; things of this; name of the; saucer of milk; sort ofway; heap of sticks; row of lamps; made of solid; one of the; doors ofthe; any of them; out of that; beds of bright; be of very; book ofrules; neck of the; sort of mixed; flavour of cherry-tart; flame of a;one of the; legs of the; game of croquet; fond of pretending; enoughof me; top of her; way of expecting; Pool of Tears; out of sight; pairof boots; roof of the; ashamed of yourself; gallons of tears;pattering of feet; pair of white; help of any; were of the; any ofthem; sorts of things; capital of Paris; capital of Rome; waters ofthe; burst of tears; tired of being; one of the; cause of this; numberof bathing; row of lodging; pool of tears; be of any; out of this;tired of swimming; way of speaking; -- of a; one of its; knowledge ofhistory; out of the; end of his; subject of conversation; -- of --;
![Page 43: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/43.jpg)
COLLECTIONS OF DOCUMENTS
Grouping/ClusteringComparison
![Page 44: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/44.jpg)
SIGNIFICANCETF-IDF TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY
![Page 45: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/45.jpg)
The occurrence of "cat"in an article in
New York Times
Significant
The occurrence of "cat"in an article in
Cat Weekly Magazine
Not Significant
![Page 46: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/46.jpg)
TF
Term Frequency (TF) =
Number of times a term appears in a document
Total number of terms in a document
1 document100 words
3 "cat"3 / 100 = 0.03
![Page 47: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/47.jpg)
IDF
Inverse Document Frequency (IDF) =
logTotal number of documents
Number of documents with term t in it( )10 million documents1000 containing "cat"
log(10,000,000/ 1000) = 4
![Page 48: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/48.jpg)
A high weight in TF*IDF is reached by a high term frequency (in a given document) and a low frequency in the number of documents that contain that term. As the term appears in more documents, the ratio inside the logarithm approaches 1, bringing the IDF and TF-IDF closer to zero.
*assuming each document that contains the word "cat" has it in it 3 times and has a total of 100 words
10,000 documents, 1 document containing "cat"(3 / 100) * ln(10,000 / 1) = 0.2763102111592855
10,000 documents, all 10,000 documents containing "cat"(3 / 100) * ln(10,000 / 10,000) = 0.0
10,000 documents, 100 documents containing "cat"(3 / 100) * ln(10,000 / 100) = 0.13815510557964275
![Page 49: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/49.jpg)
Visualizing Email Content:Portraying Relationships from
Conversational Histories
Fernanda B. Viégas , 2006
Themail
![Page 50: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/50.jpg)
GROUPINGCLASSIFICATION, CLUSTERING
![Page 51: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/51.jpg)
World
Sports
Entertainment
Life
Arts
News
Cla
ssifi
er
![Page 52: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/52.jpg)
News Classifier
New article, without a subject assigned yet
World
![Page 53: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/53.jpg)
Many Bills: Engaging Citizens throughVisualizations of Congressional Legislation
Yannick Assogba, Irene Ros, Joan DiMicco, Matt McKeonIBM Researchhttp://clome.info/papers/manybills_chi.pdf
![Page 54: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/54.jpg)
![Page 55: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/55.jpg)
COMPARISONCOSINE SIMILARITY, CLUSTERING
![Page 56: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/56.jpg)
Mike Bostock, Sean Carter, http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html
![Page 57: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/57.jpg)
DocumentDocumentDocumentDocumentDocume
nt...
Document
collection
cat (5), house, dog, monkey
air, strike(2), tanker, machine
flight(2), strike, machine, guns
light, air(4), balloon, flight
cat, scratch, blood(2), hospital
flight(4), commercial, aviation
![Page 58: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/58.jpg)
cathousedog
monkeyair
striketankermachineflightgunslight
balloonscratchblood
hospitalcommercialaviation
[5,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0]
...
[0,0,0,0,1,2,1,1,0,0,0,0,0,0,0,0]
[0,0,0,0,0,1,0,1,2,1,0,0,0,0,0,0]
![Page 59: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/59.jpg)
K-MEANS CLUSTERING
![Page 60: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/60.jpg)
TOOLS• Textkit
http://learntextvis.github.io/textkit/
![Page 61: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/61.jpg)
TOOLS, NO PROGRAMMING• AntConc (http://www.laurenceanthony.net/software.html)• Overview Project (https://www.overviewdocs.com/)• Voyant (http://voyant-tools.org/)• Lexos (http://lexos.wheatoncollege.edu/upload)• Word and Phrase (http://www.wordandphrase.info/)• CorpKit (http://interrogator.github.io/corpkit/index.html)
Tool collections:• DiRT tools - http://dirtdirectory.org/• TAPoR (http://tapor.ca/home)• http://guides.library.duke.edu/c.php?g=289707&p=1930856
Many of these from a great talk by Lynn Cherny - http://ghostweather.slides.com/lynncherny/text-data-analysis-without-programming
![Page 62: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/62.jpg)
NOT COVERED, BUT NOTE WORTHY
• Topic Modeling
• Sentiment Analysis
• Entity Extraction
• Word2Vec
• Neural networks
• Search
• Historic Trends
![Page 63: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/63.jpg)
GO VISUALIZE SOME WORDS
![Page 65: PLOTCON NYC: Text is data! Analysis and Visualization Methods](https://reader036.fdocuments.us/reader036/viewer/2022062522/589c80b61a28abc2258b6291/html5/thumbnails/65.jpg)
CITATIONIcon Created by Piola, Noun Project: https://thenounproject.com/search/?q=document&i=709260