Roots and Routes: Crowdsourced Manuscript Transcription Workshop

31
Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012

description

3-hour long workshop on crowdsourced transcription software for the University of Toronto's Roots and Routes seminar in 2012.

Transcript of Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Page 1: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Crowdsourced Manuscript Transcription

Ben BrumfieldRoots and Routes 2012

Page 2: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Not just crowdsourcing...

● Collaborative work● Off-site solo work● Private work

Page 3: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Not just manuscripts...

● Maps● Textiles● Music● Flawed OCR

Page 4: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Not just transcription...

● Indexing● Editing● Identification Counting seals on Arctic ice caps.

Page 5: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

What it isn't

We'll concentrate on web-based tools for extracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile Display Tools exist for these tasks, nevertheless.

Page 6: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Break

What materials are you working with outside of modern, printed books and websites?

Page 7: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Origins (Approaches)

Two Approaches and one Dead End● Indexing● Editing● Tagging

Page 8: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Indexing

● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification

Page 9: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Editing

● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions

Page 10: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Tagging

● Too small● Too imprecise

Page 11: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Origins (Traditions)

● OCR Correction● Documentary Editing● Genealogy● Natural Science● Astronomy Split this into 5 slides

Page 12: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Online Tools

● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-up and

customization● All require making trade-offs

Page 13: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Lab Session 1: Breadth

NYPL What's on the MenuIndexing

Wikisource

Editing

Page 14: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Selection Factors

● Source Material● Transcript Purpose● Organizational/Project Management Fit● Financial and Technical Resources

Page 15: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Source Material

Evaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important

is that layout?

Page 16: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Purpose

How will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze

the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?

Page 17: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Organizational/Project Management Fit

● How important is traditional editorial workflow?

● Will you rely on volunteers? How will you motivate them?

● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?

Page 18: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Financial and Technical Resources

Do you have or need:● System administrators to install non-hosted

software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for

customization?● Support for on-going costs to keep the site

running, however small?

Page 19: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Lab Session 2: Markup Options

FromThePage TranscribeBentham

Page 20: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Technical Questions to Answer

● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public

face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?

Page 21: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Wikisource

Pro:● Mediawiki plus its add-on modules (e.g.

print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.

Page 22: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Bentham Transcription Desk

Pro: ● MediaWiki is very mature.● TEI Toolbar (can also be used on other

systems)● Deployed outside original project. Con:● Development efforts halted.

Page 23: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Scripto

Pro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development. Con:● Your CMS handles all metadata.● Mark-up is extremely limited.

Page 24: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

FromThePage

Pro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available. Con:● Single developer (me).● No TEI mark-up.

Page 25: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Islandora TEI Editor

Caveat: I don't know much about this tool or this team. ● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.

Page 26: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

T-PEN

Caveat: I don't know much about this tool. ● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.

Page 27: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Scribe

Pro:● Excellent for complex layout or non-

documentary transcription.● Zooniverse team is large, well-funded,

experienced.● Configurable.Con:● No automated tool for loading images or

viewing transcript database (yet!)● No concept of image-as-a-text.

Page 28: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Pybossa

Caveat: I don't know much about this tool or this team. ● Open Knowledge Foundation's

crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.

Page 29: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

TextLab

Caveat: I don't know much about this tool or this team. ● Melville Electronic Library.● Direct addition of TEI tags to image.

Page 30: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Lab Session 3: Configuration

ScribeOld Weather, What's the Score, Development deployments

Page 31: Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Find me

Ben [email protected]

http://manuscripttranscription.blogspot.com/@benwbrum