Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 [email protected].

16
Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 [email protected]

Transcript of Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 [email protected].

Page 1: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Digitizing Serialized FictionKirk HessDH 2013 – July 17, [email protected]

Page 2: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Serialized Fictionin Farm Newspapers• Libguide

for Serialized Fiction in the Farm Field and Fireside collection • “Many of the newspapers in Farm, Field and Fireside published

serialized fiction written by renowned authors as well as lesser known writers and even some long-time readers. The value of this publishing model enabled literature to be disseminated to rural communities and expand the bounds of American literary culture across geographic and socioeconomic lines. “

Page 3: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Serialized Fiction in the Farmer’s Wife• Farmer’s Wife was published from 1897-1939; April 1906-April

1939 digitized in FFF• “Many of the stories could be characterized as romance fiction

designed to appeal to farm wives”• Previously indexed in practicum project; stored in spreadsheet

(link). Intended as a database with a way to link to existing articles.

Page 4: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Newspaper Digitization• Select Newspaper• Create page images

• Microfilmed?• If not, film• If film bad, fix film

• Scan film• Tiff image, cropped, deskewed

• Article Segmentation• Process TIFF to Olive specs• OCR text, Article/Ad/Image segmentation

• Load to access system (Olive Active Paper/Veridian)

Page 5: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Finding Serialized Fiction

Software doesn’t make this easy to findNo metadataOCR problems with newsprintArticles span multiple issues, no links between them

On the other hand…The text is thereThe images are thereThe articles are segmented

Page 6: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

OCR issues• Only adminstrators• A lot of errors, not a lot of people• Manual process, not easily automatable• Full text not visible• Users expect correct text

• Demo’d many solutions, coalesced around Omekahttp://omeka.org

• Moving to Veridian Fall 2014

Page 7: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Prototype Omeka/Scripto• http://uller.grainger.illinois.edu/omeka/• Workflow http

://hpnl.pbworks.com/w/page/53056034/Omeka%20instructions

• PM/Technical Lead (Kirk), 4 part time editors (Olivia, Matt, Shoshana, Carl)

• Completed project in ~ 4 months, 736 serials

Page 8: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Completed story• THE MYSTERIOUS MCCORKLES by F. Roney Weir• http://uller.grainger.uiuc.edu/omeka/items/show/20

Page 9: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

TEI?• Requires training, manual process for full annotations, lite TEI

can be automatically generated from corrected text• Has some advantages for scholars over plain text• XTF Example• http://uller.grainger.uiuc.edu:8080/xtf/search

• More McCorkles• http://uller.grainger.uiuc.edu:8080/xtf/view?docId=tei/TSF00013/TSF00013.

xml&chunk.id=AR00300&toc.id=&brand=default

Page 10: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Beyond the Berry Farm• How can we prioritize work so important text is corrected

first?Example:• http://uller.grainger.uiuc.edu/omeka/items/show/6• Words: 2876, spelling errors 55, 98% accuracy• Predictive solutions

• How can we identify serialized fiction without having to find it manually and put it in a spreadsheet?

Page 11: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Identifying Serialized Fiction• Building a Feature set• Common N-Grams

• Chapter (number/roman numeral)• To Be Continued• The End

• Topic/Genre/Theme (Romance, children stories, holidays, etc.)• Named entity extraction• Predictive solutions (Google API)

Page 12: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Topics• Topic Analysis (Latent Dirichlet Allocation) David Blei,et al.• A document contains a finite amount of topics, and each word

can be assigned to a topic• Used Mallet (http://mallet.cs.umass.edu/)• Example output:

Topic 10Barney time water butter put milk de corn wagon chickens day weather dinner clean Mercy home lay table dry made Marigold morning make Anne bread

Page 13: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Network Analysis • Topics and Documents are nodes, docs in topics are edges.• By generating a network graph (Gephi) we can see

connections• By using clustering algorithms, we can see clusters of

documents around a topic• Train data mining algorithm?

Page 14: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Named Entity Extraction• Proper names interfere with LSA

• Manually generate stop word list• Lots of names to find!

• Programmatically find names• Stanford NLP Named Entity Recognizer

Page 15: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

NLTK• Similar to Movie Review sample using a small subset of articles, Naïve Bayes

Classifier using NTLK, top 2000 words• >>> classifier.show_most_informative_features(5)• contains(having) = True fictio : nonfic = 1.9 : 1.0• contains(plan) = True fictio : nonfic = 1.9 : 1.0• contains(growing) = True fictio : nonfic = 1.9 : 1.0• contains(entertaining) = True fictio : nonfic = 1.9 : 1.0• contains(home) = True fictio : nonfic = 1.9 : 1.0

• High accuracy (> .95) but weak ratios

Page 16: Digitizing Serialized Fiction Kirk Hess DH 2013 – July 17, 2013 kirkhess@illinois.edu.

Next Steps• Implement Veridian• Crowdsource OCR correction• Direct access to index (Solr)

• Continue NLP research using NLTK Toolkit