Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Post on 22-Nov-2014

697 views 0 download

description

3-hour long workshop on crowdsourced transcription software for the University of Toronto's Roots and Routes seminar in 2012.

Transcript of Roots and Routes: Crowdsourced Manuscript Transcription Workshop

Crowdsourced Manuscript Transcription

Ben BrumfieldRoots and Routes 2012

Not just crowdsourcing...

● Collaborative work● Off-site solo work● Private work

Not just manuscripts...

● Maps● Textiles● Music● Flawed OCR

Not just transcription...

● Indexing● Editing● Identification Counting seals on Arctic ice caps.

What it isn't

We'll concentrate on web-based tools for extracting text from images, not addressing:● Oral History● Video● Audio Transcription● Image Manipulation● Transcription/Facsimile Display Tools exist for these tasks, nevertheless.

Break

What materials are you working with outside of modern, printed books and websites?

Origins (Approaches)

Two Approaches and one Dead End● Indexing● Editing● Tagging

Indexing

● Structured Data● Extracts from Text vs. Representing Text● Databases for Search and Analysis● Granular Quality Control● Gamification

Editing

● Books, Diaries, Letters, Articles● Representing Text● Traditional Editorial Workflow● Digital or Print Editions

Tagging

● Too small● Too imprecise

Origins (Traditions)

● OCR Correction● Documentary Editing● Genealogy● Natural Science● Astronomy Split this into 5 slides

Online Tools

● Recent (none older than 2005)● Influenced by origin● Still pretty raw● Most require tech expertise for set-up and

customization● All require making trade-offs

Lab Session 1: Breadth

NYPL What's on the MenuIndexing

Wikisource

Editing

Selection Factors

● Source Material● Transcript Purpose● Organizational/Project Management Fit● Financial and Technical Resources

Source Material

Evaluating your source material:● Is it of interest to anyone else?● Is it under copyright?● Does it need restricted access?● Is it composed of documents or records?● Is it non-textual?● How complex is the layout? How important

is that layout?

Purpose

How will you be using the transcribed data?● Traditional print editions● Searchable online editions● Do you want to use the system to analyze

the text?● How do you want to analyze the text?● Is public engagement a goal?● Should the transcripts be open?

Organizational/Project Management Fit

● How important is traditional editorial workflow?

● Will you rely on volunteers? How will you motivate them?

● What is the duration of the project?● Is there a "final version"?● Is TEI a mandate?

Financial and Technical Resources

Do you have or need:● System administrators to install non-hosted

software?● Money to pay hosting costs?● Programming skills to customize a tool?● Money to pay programmers for

customization?● Support for on-going costs to keep the site

running, however small?

Lab Session 2: Markup Options

FromThePage TranscribeBentham

Technical Questions to Answer

● Where are the images now?● How do images get into the system?● How do transcripts get out of the system?● How mature is the underlying technology?● How configurable is the technology?● How does the system work with the public

face of your project?● Where does the metadata live?● Who will maintain this? How long?● How many sites are using this system?

Wikisource

Pro:● Mediawiki plus its add-on modules (e.g.

print-on-demand, export).● Wikimedia community.● Incredibly mature.Con:● Wikimedia policy.● Public editing.● Limited mark-up.

Bentham Transcription Desk

Pro: ● MediaWiki is very mature.● TEI Toolbar (can also be used on other

systems)● Deployed outside original project. Con:● Development efforts halted.

Scripto

Pro:● Team at CHNM has a great track record.● Your CMS is your public face.● MediaWiki is very mature.● Deployed and under active development. Con:● Your CMS handles all metadata.● Mark-up is extremely limited.

FromThePage

Pro:● Designed for intensive editing and indexing.● Semantic mark-up and analysis.● Hosting available. Con:● Single developer (me).● No TEI mark-up.

Islandora TEI Editor

Caveat: I don't know much about this tool or this team. ● Based on Drupal and Fedora● Supports TEI via friendly interface● Many Drupal-based projects considering it.

T-PEN

Caveat: I don't know much about this tool. ● Designed for medieval manuscripts.● Supports TEI natively.● Line-by-line interface.● Hosted version available.

Scribe

Pro:● Excellent for complex layout or non-

documentary transcription.● Zooniverse team is large, well-funded,

experienced.● Configurable.Con:● No automated tool for loading images or

viewing transcript database (yet!)● No concept of image-as-a-text.

Pybossa

Caveat: I don't know much about this tool or this team. ● Open Knowledge Foundation's

crowdsourcing task management tool.● Designed for tabular data.● Google Spreadsheet data entry.● Extremely young.

TextLab

Caveat: I don't know much about this tool or this team. ● Melville Electronic Library.● Direct addition of TEI tags to image.

Lab Session 3: Configuration

ScribeOld Weather, What's the Score, Development deployments

Find me

Ben Brumfieldbenwbrum@gmail.com

http://manuscripttranscription.blogspot.com/@benwbrum