Temple University Digital Scholarship Center: Model of the Month Club: September 2015
-
Upload
liz-rodrigues -
Category
Education
-
view
217 -
download
0
Transcript of Temple University Digital Scholarship Center: Model of the Month Club: September 2015
Model of the Month ClubMeeting 1:
What is a model in DH? Example: Underwood et al., Understanding Genre
Essentially, all models are wrong, but some are useful.
--George Boxstatistician
1919-1913
What’s a model? (broadest definition)
A model is a simplified representation of something, and in principle models can be built out of words, balsa wood, or anything you like. In practice... statistical models are often equations that describe the probability of an association between variables.
Ted Underwood, Seven ways humanists are using computers to understand text.http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
What kinds of models are we looking at in this workshop?
Ted Underwood, Seven ways humanists are using computers to understand text.http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
What kinds of model are we looking at today?
Ted Underwood, Seven ways humanists are using computers to understand text.http://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/
Understanding Genre in a Collection of a Million Volumes (Underwood et al.)Problem: Classification
Why is this a problem?
1)HathiTrust has poor genre metadata.
2)Volumes are generically heterogeneous.
Desired result: provide a way of sorting HathiTrust text data to make it useful for literary scholars
Classification as a form of machine learningIn general:
Data → training data → predictive classifier (model) → prediction → evaluation
This project:
Text → coded text → regularized logistic regression → prediction → 93.9% accurate
+hidden Markov smoothing
Data: TextWe began by obtaining full text of all public-domain English-language
works in HathiTrust between 1700 and 1922. Organizing a group of five readers,
we asked them to label individual pages in a total of 414 books; this produced
our training data. We transformed the text of all the books into counts of fea-
tures on each page; most of these features were words that we counted, but we
also recorded other details of page structure.Underwood: Understanding Genre Interim Report
Background: Bag of words
Training data: Coded text 223 volumes were tagged by five people, with assigned volume lists over-lapping so that almost all the pages in the volumes were read by at least tworeaders (and some by three). This strategy allowed us to make tentative esti-mates of human dissensus, which were invaluable. But it was a relatively slowprocess, because it required coordination. The remaining 191 volumes weresimply tagged at the page level by the PI. In cases where we had three readers,we resolved human disagreements by voting. In other cases, we accepted themore general genre tag, or the tag produced by more experienced readers.But:Selection of volumes: was probably the most questionable aspect of our methodology, and an area we will give more attention as we expand into the twentieth century.
Classification: Feature engineering We used 1062 features in our models. 1036 of them were words, or word cate-gories; a full list is available on Github: https://github.com/tedunderwood/genre/blob/master/data/biggestvocabulary.txt. In general, we selectedfeatures by grouping pages into the categories we planned to classify. We tookthe top 500 words from each category, and then grouped the words from allcategories into a master list that we could limit to the top N most frequentwords. This ensured that our list contained words like “8vo” and “ibid” thatmight be uncommon in the whole corpus, but extremely dispositive as cluesabout a particular class of pages. We normalized everything to lowercase (aftercounting certain forms of capitalization as “structural features”) and truncatedfinal apostrophe-s.
Classification: Regularized Logistic RegressionOnce we had designed this overall workflow, it was possible to plug dif-ferent classification algorithms into the page-level classification step of theprocess. We tried a range of algorithms here, including random forests andsupport vector machines. We also tried a range of different ensemble strate-gies, including strategies that combine multiple algorithms, before settling onan ensemble of regularized logistic models, trained by comparing each genreto all the other genres collectively.
Regularized Logistic Regressionname for a kind of classification algorithm: a set of assumptions &
mathematical processes designed to predict the likelihood that a given set of features occurring on a single page mean that page belongs to a specific genre
in general:
calculating the odds that, given the presence of a particular feature/set of features, a certain class is likely compared to the odds of instance being that class without those features
does not need or assume linear relationship between variables
does not assume a distribution
creates a decision boundary used to produce binary outcomes (yes or no, fiction or not fiction)
Example
http://courses.washington.edu/css490/2012.Winter/lecture_slides/05b_logistic_regression.pdf
+a Hidden Markov Modelassumes probability affected by immediate prior in a sequence & that
this is a hidden state (something external influencing instance probability)
used in this case to try to incorporate the fact that the genre of the volume has something to do with the volume of the page
From project:There are a variety of clever approaches that might be tried to coordi-nate page-level predictions with knowledge of volume structure. We
traineda hidden Markov model, which is is a relatively simple approach. The
modelcontains information about the probability of transition from one
genre toanother, so it is in a sense a model of volume structure. But in
practice, itsmain effect is to smooth out noisy single-page errors—for instance, it
was goodat catching a few isolated pages misclassified as nonfiction in the
middle of anovel.
Evaluation
Result
https://sharc.hathitrust.org/genre
Discussion