Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)
-
Upload
lynn-cherny -
Category
Technology
-
view
102 -
download
1
description
Transcript of Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)
![Page 1: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/1.jpg)
Bestseller Analysis: Visualizing Fiction
Lynn Cherny @arnicas
PyData Boston 2013
![Page 2: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/2.jpg)
Language, Sex, Violence (also spoilers)
TEXT
![Page 3: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/3.jpg)
Today’s Books
![Page 4: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/4.jpg)
THE VIDEO OF THAT TALK:
http://blogger.ghostweather.com/2013/06/analysis-of-fiction-my-openvisconf-talk.html
http://www.youtube.com/watch?v=f41U936WqPM
BASED ON A PREVIOUS TALK:
This talk focuses on some more technical details and more on topic analysis. The IPython notebook of code samples for this lives here: http://ghostweather.com/essays/talks/openvisconf/Pydata_Code.ipynb
![Page 5: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/5.jpg)
http://www.economist.com/blogs/graphicdetail/2012/11/fifty-shades-data-visualisations
BY
![Page 6: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/6.jpg)
Text Classification (Commonly)
§ “Bag of words” – each document is considered a collection of words, independent of order
§ Frequencies of certain words are used to identify the texts
Seems like this should work with sex scenes, right? Only so many body parts and behaviors, right?!
![Page 7: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/7.jpg)
Data Label
Estdsgfd fdsatreatret dfds Yes
Dsrdsf drerear ewrewtrew No
Reret retdrtd rewrewrtew Yes
Dsfgdg fdsfd Yes
Algorithm
Train
Test
New data in the wild
![Page 8: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/8.jpg)
Sex Scene Detection First Steps
1. Buy 50 Shades on Amazon, unlock text in Calibre, save as TXT file.
2. Cut up a doc into 500 “word” chunks using Python
![Page 9: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/9.jpg)
Cutting up the document
![Page 10: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/10.jpg)
“Would you like to sit?” He waves me toward an L-shaped white leather couch. His office is way too big for just one man. In front of the floor-to-ceiling windows, there’s a modern dark wood desk that six people could comfortably eat around. It matches the coffee table by the couch. Everything else is white—ceiling, floors, and walls, except for the wall by the door, where a mosaic of small paintings hang, thirty-six of them arranged in a square. They are exquisite—a series of mundane, forgotten objects painted in such precise detail they look like photographs. Displayed together, they are breathtaking. “A local artist. Trouton,” says Grey when he catches my gaze. “They’re lovely. Raising the ordinary to extraordinary,” I murmur, distracted both by him and the paintings. He cocks his head to one side and regards me intently. “I couldn’t agree more, Miss Steele,” he replies, his voice soft, and for some inexplicable reason I find myself blushing.
Sample of 50 Shades of Grey
![Page 11: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/11.jpg)
Manual labeling suckage
http://www.deargrumpycat.com/
![Page 12: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/12.jpg)
Outsourced to Mechanical Turk
![Page 13: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/13.jpg)
WHAT’S A SEX SCENE, ANYWAY?
![Page 14: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/14.jpg)
Zara.com
![Page 15: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/15.jpg)
http://www.ebay.com/itm/Adult-Sex-Toys-Tools-Handcuffs-Eye-mask-Neck-Band-Strap-Whip-Rope-/330845727274?pt= UK_Home_Garden_Celebrations_Occasions_ET&hash=item4d07f12a2a
![Page 16: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/16.jpg)
Sexually Exxxplicit, but still a
http://www.icts.uiowa.edu/sites/default/files/contract.jpg
![Page 17: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/17.jpg)
![Page 18: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/18.jpg)
How’d the raters do?
Sex Scenes
Steamy Scenes
![Page 19: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/19.jpg)
Comparing to “Pornographic”…
![Page 20: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/20.jpg)
Comparing:
![Page 21: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/21.jpg)
On to the learning algorithm…
So, the training data: - The text chunks - The score the raters gave it (averaged) as “truth”
I started with Python’s NLTK (Natural Language Toolkit) and Naïve Bayes for classifying (working in an ipython notebook).
![Page 22: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/22.jpg)
Resources on NLTK Naïve Bayes
§ The NLTK book chapter: http://nltk.googlecode.com/svn/trunk/doc/book/ch06.html
§ Jacob Perkins’ example of sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
![Page 23: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/23.jpg)
Perkins’ NLTK code for this… import nltk.classify.util from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews def word_feats(words): return dict([(word, True) for word in words]) negids = movie_reviews.fileids('neg') posids = movie_reviews.fileids('pos') negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids] posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids] negcutoff = len(negfeats)*3/4 poscutoff = len(posfeats)*3/4 trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff] testfeats = negfeats[negcutoff:] + posfeats[poscutoff:] print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats)) classifier = NaiveBayesClassifier.train(trainfeats) print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats) classifier.show_most_informative_features()
![Page 24: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/24.jpg)
His movie sentiment output
72% accuracy, trained on 1500 inputs.
![Page 25: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/25.jpg)
My results on 50 Shades sex Scenes
82 % accuracy!
![Page 26: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/26.jpg)
Previously with less “pos” data: not so great at 68%
“packet” (they use a lot of condoms)
![Page 27: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/27.jpg)
Python’s sklearn (scikit-learn)
Lots of classifiers for sparse data like text!
http://scikit-learn.org/0.13/auto_examples/document_classification_20newsgroups.html
![Page 28: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/28.jpg)
Using a lemmatizer step in the pipeline (to strip endings off words, since some fiction in my later samples was in present tense)
Pipelines in sklearn makes it incredibly easy to run lots of experiments.
Fit the model, using training data and “target” answers (in this case, “50 Shades of Grey”)
Test the model on new data (in this case, “50 Shades Darker”). Check how it did against the answers.
Now we’re
at 88%
![Page 29: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/29.jpg)
Interpreting the results… Let’s make a tool!
Demo: http://www.ghostweather.com/essays/talks/openvisconf/text_scores/rollover.html
![Page 30: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/30.jpg)
Really amazing P.S. here…
I paid for coding of a bunch of fan-fiction for sex scenes too, and fed them in to the sklearn SGD classifier.
(Note that 50 Shades started life as Twilight fanfic.)
*cross-validating with entire set, not just 50 Shades books.
97% accuracy achieved!*
![Page 31: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/31.jpg)
TOPIC ANALYSIS Moving on to Dan Brown!
![Page 32: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/32.jpg)
Almost naked, Silas hurled his pale body down the staircase. He knew he had been betrayed, but by whom? When he reached the foyer, more officers were surging through the front door. Silas turned the other way and dashed deeper into the residence hall. The women's entrance. Every Opus Dei building has one. Winding down narrow hallways, Silas snaked through a kitchen, past terrified workers, who left to avoid the naked albino as he knocked over bowls and silverware, bursting into a dark hallway near the boiler room. He now saw the door he sought, an exit light gleaming at the end. Running full speed through the door out into the rain, Silas leapt off the low landing, not seeing the officer coming the other way until it was too late. The two men collided, Silas's broad, naked shoulder grinding into the man's sternum with crushing force. He drove the officer backward onto the pavement, landing hard on top of him. The officer's gun clattered away. Silas could hear men running down the hall shouting. Rolling, he grabbed the loose gun just as the officers emerged. A shot rang out on the stairs, and Silas felt a searing pain below his ribs. Filled with rage, he opened fire at all three officers, their blood spraying. A dark shadow loomed behind, coming out of nowhere. The angry hands that grabbed at his bare shoulders felt as if they were infused with the power of the devil himself. The man roared in his ear. SILAS, NO! Silas spun and fired. Their eyes met. Silas was already screaming in horror as Bishop Aringarosa fell.
Chapter 96 DaVinci Code
![Page 33: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/33.jpg)
Blei (2011)
![Page 34: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/34.jpg)
Resources for Topic Analysis
§ David Mimno’s java Mallet is “the one everyone uses”: - http://mallet.cs.umass.edu/index.php - The R mallet package is rather nice, too:
http://www.cs.princeton.edu/~mimno/R/ - This is a GUI wrapper for mallet that outputs nice csv
and html pages: https://code.google.com/p/topic-modeling-tool/
§ Some pure python (and C) implementations (toy code, primarily) are listed on Blei’s website: http://www.cs.princeton.edu/~blei/topicmodeling.html
![Page 35: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/35.jpg)
Topic Modeling Tool (GUI)
![Page 36: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/36.jpg)
Post run…
![Page 37: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/37.jpg)
Pros/Cons vs CMD-Line Mallet
Pros
§ Allows stopword file specifying
§ Produces csv and html output in a near dir structure
§ Has a GUI (simpler to just get going)
Cons
§ Runs with defaults, so no optimize-interval or other cmd line options
§ No diagnostic output (a command-line option)
§ Not super-well doc’d
Tutorial on cmd line usage: http://programminghistorian.org/lessons/topic-modeling-and-mallet
![Page 38: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/38.jpg)
2 of the 3 CSV Output files
![Page 39: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/39.jpg)
Notice a horrible thing here:
![Page 40: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/40.jpg)
My notebook has lots of code to process these files…
![Page 41: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/41.jpg)
A few pandas stats…
107 chapters, 10 topics “requested”…
Topic proportion distribution…
![Page 42: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/42.jpg)
The default HTML output is a little lacking…
A bipartite graph of chapters and topics is an obvious vis method….
![Page 43: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/43.jpg)
Network JSON in D3.js
![Page 44: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/44.jpg)
Making the objects: Make objects of nodes, links, and any extra data values on each that you want…
![Page 45: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/45.jpg)
Let’s try a hairball!
![Page 46: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/46.jpg)
Improving the network’s UI…
Adding strength, highlight effect, another variable, and informative tooltips.
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_docs_network/index_better.html
![Page 47: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/47.jpg)
Tricks in D3 – scales:
![Page 48: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/48.jpg)
Maybe I need One More Tool. Any word relations of interest? Let’s try another hairball…
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_words_network/index.html
![Page 49: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/49.jpg)
Small “constellations”
show shared words (an
accident that’s useful!)
Filtered to only the “exciting” nodes…
![Page 50: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/50.jpg)
Another tool: DaVinci Code topics to chapters mapping
“Excitement” rating color scale avg by chapter, ordered (obviously)
Topics (48ish) per chapter (108)
Chapter 1… to Chapter 108
![Page 51: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/51.jpg)
Ah, but since it’s svg/d3… var chart = chart.append("g").attr("translate","0," + y).attr("transform","rotate(90 600 600)");
But, maybe I need chapter summaries…. So I can relate them to the topics?
![Page 52: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/52.jpg)
Add some topic-tooltips and fade-outs….
Demo: http://www.ghostweather.com/essays/talks/openvisconf/topic_arc_diagram/TopicArc.html
![Page 53: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/53.jpg)
But what did this show?
Some topics are just neither exciting nor dull – topic clustering (as I did it) had little
to do with action scenes. It’s slightly helpful for topics, though J
These nodes are shaded from gray (dull) to red (exciting)
![Page 54: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/54.jpg)
Coming soon…
Color words in texts by topic assigment, to help tune the stopwords and set up next steps: • Pre-process text for just the verbs? • Clean out a class of proper names • Extract sentences containing the topic words
to help describe the topics/texts better
![Page 55: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/55.jpg)
Wrapping up…
§ Python is great for the data munging and analysis
§ Some analysis needs serious vis support § Save yourself some work in javascript using
Python before you get into js
§ D3 is a great tool for iterative interactive exploration of your analysis results
![Page 56: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/56.jpg)
THANKS! @arnicas, [email protected]
My thanks to…. Luminosity for help with Dan Brown summaries, Jim Vallandingham (@vlandham) for network parameter and coffeescript help.
Hey, I am a consultant for data analysis and visualization. Look me up!
![Page 57: Bestseller Analysis: Visualization Fiction (for PyData Boston 2013)](https://reader034.fdocuments.us/reader034/viewer/2022051819/54c674ce4a795913618b4704/html5/thumbnails/57.jpg)
A Few More References
§ Applied Machine Learning with Scikit-Learn:http://scikit-learn.github.io/scikit-learn-tutorial/index.html
§ Naïve Bayes for text in Scikit-Learn: http://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes
§ Stochastic Gradient Descent in Scikit-Learn: http://scikit-learn.org/0.13/modules/sgd.html § Nice tutorial overview of working with text data:
scikit-learn.github.io/scikit-learn-tutorial/working_with_text_data.html § Bearcart by Rob Story – Rickshaw timeseries graphs from python pandas datastructure in 4
lines (https://github.com/wrobstory/bearcart)
§ LDA topic modeling tool with UI - https://code.google.com/p/topic-modeling-tool/ § Scott Weingart’s nice overview of LDA Topic Modeling in Digital Humanities:
http://www.scottbot.net/HIAL/?p=221 § Elijah Meeks’ lovely set of articles on LDA & Digital Humanties vis:
https://dhs.stanford.edu/comprehending-the-digital-humanities/ § Jim Vallandingham’s tooltip code and a great demo/tutorial:
http://flowingdata.com/2012/08/02/how-to-make-an-interactive-network-visualization/ § Rickshaw for timeseries graphs: https://github.com/shutterstock/rickshaw