Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012.

31
Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012

Transcript of Crowdsourced Manuscript Transcription Ben Brumfield Roots and Routes 2012.

Crowdsourced Manuscript Transcription

Ben BrumfieldRoots and Routes 2012

Not just crowdsourcing...

• Collaborative work

• Off-site solo work

• Private work

Not just manuscripts...

• Maps

• Textiles

• Music

• Flawed OCR

Not just transcription...

• Indexing

• Editing

• Identification

Counting seals on Arctic ice caps.

What it isn't

We'll concentrate on web-based tools for extracting text from images, not addressing:

• Oral History

• Video

• Audio Transcription

• Image Manipulation

• Transcription/Facsimile Display

Tools exist for these tasks, nevertheless.

Break

What materials are you working with outside of modern, printed books and websites?

Origins (Approaches)

Two Approaches and one Dead End

• Indexing

• Editing

• Tagging

Indexing

• Structured Data

• Extracts from Text vs. Representing Text

• Databases for Search and Analysis

• Granular Quality Control

• Gamification

Editing

• Books, Diaries, Letters, Articles

• Representing Text

• Traditional Editorial Workflow

• Digital or Print Editions

Tagging

• Too small

• Too imprecise

Origins (Traditions)

• OCR Correction

• Documentary Editing

• Genealogy

• Natural Science

• Astronomy

Split this into 5 slides

Online Tools

• Recent (none older than 2005)

• Influenced by origin

• Still pretty raw

• Most require tech expertise for set-up and customization

• All require making trade-offs

Lab Session 1: Breadth

NYPL What's on the Menu

Indexing

Wikisource

Editing

Selection Factors

• Source Material

• Transcript Purpose

• Organizational/Project Management Fit

• Financial and Technical Resources

Source Material

Evaluating your source material:

• Is it of interest to anyone else?

• Is it under copyright?

• Does it need restricted access?

• Is it composed of documents or records?

• Is it non-textual?

• How complex is the layout? How important is that layout?

Purpose

How will you be using the transcribed data?

• Traditional print editions

• Searchable online editions

• Do you want to use the system to analyze the text?

• How do you want to analyze the text?

• Is public engagement a goal?

• Should the transcripts be open?

Organizational/Project Management Fit

• How important is traditional editorial workflow?

• Will you rely on volunteers? How will you motivate them?

• What is the duration of the project?

• Is there a "final version"?

• Is TEI a mandate?

Financial and Technical Resources

Do you have or need:

• System administrators to install non-hosted software?

• Money to pay hosting costs?

• Programming skills to customize a tool?

• Money to pay programmers for customization?

• Support for on-going costs to keep the site running, however small?

Lab Session 2: Markup Options

FromThePage

TranscribeBentham

Technical Questions to Answer

• Where are the images now?

• How do images get into the system?

• How do transcripts get out of the system?

• How mature is the underlying technology?

• How configurable is the technology?

• How does the system work with the public face of your project?

• Where does the metadata live?

• Who will maintain this? How long?

• How many sites are using this system?

Wikisource

Pro:

• Mediawiki plus its add-on modules (e.g. print-on-demand, export).

• Wikimedia community.

• Incredibly mature.

Con:

• Wikimedia policy.

• Public editing.

• Limited mark-up.

Bentham Transcription Desk

Pro:

• MediaWiki is very mature.

• TEI Toolbar (can also be used on other systems)

• Deployed outside original project.

Con:

• Development efforts halted.

Scripto

Pro:

• Team at CHNM has a great track record.

• Your CMS is your public face.

• MediaWiki is very mature.

• Deployed and under active development.

Con:

• Your CMS handles all metadata.

• Mark-up is extremely limited.

FromThePage

Pro:

• Designed for intensive editing and indexing.

• Semantic mark-up and analysis.

• Hosting available.

Con:

• Single developer (me).

• No TEI mark-up.

Islandora TEI Editor

Caveat: I don't know much about this tool or this team.

• Based on Drupal and Fedora

• Supports TEI via friendly interface

• Many Drupal-based projects considering it.

T-PEN

Caveat: I don't know much about this tool.

• Designed for medieval manuscripts.

• Supports TEI natively.

• Line-by-line interface.

• Hosted version available.

Scribe

Pro:

• Excellent for complex layout or non-documentary transcription.

• Zooniverse team is large, well-funded, experienced.

• Configurable.

Con:

• No automated tool for loading images or viewing transcript database (yet!)

• No concept of image-as-a-text.

Pybossa

Caveat: I don't know much about this tool or this team.

• Open Knowledge Foundation's crowdsourcing task management tool.

• Designed for tabular data.

• Google Spreadsheet data entry.

• Extremely young.

TextLab

Caveat: I don't know much about this tool or this team.

• Melville Electronic Library.

• Direct addition of TEI tags to image.

Lab Session 3: Configuration

Scribe

Old Weather,

What's the Score,

Development deployments

Find me

Ben Brumfield

[email protected]

http://manuscripttranscription.blogspot.com/

@benwbrum