CONTENTdm - OCLC

39
Geri Ingram Community Manager OCLC CONTENTdm ® Working with Text and PDFs Spring 2015 CONTENTdm User Conference Goucher College Baltimore MD May 27, 2015

Transcript of CONTENTdm - OCLC

Page 1: CONTENTdm - OCLC

Geri Ingram

Community Manager

OCLC CONTENTdm®Working with Text and PDFs

Spring 2015 CONTENTdm User Conference

Goucher College

Baltimore MD

May 27, 2015

Page 2: CONTENTdm - OCLC

The world’s libraries. Connected.

• To get the most out of this session

• Either you have:

• Experience building CONTENTdm collections

OR

• Attended recent CONTENTdm Training

Intended audience

Page 3: CONTENTdm - OCLC

The world’s libraries. Connected.

• Context-setting• Data

• Filetypes

• Organization

• Naming

• Collection Configuration• Adding text-rich digital items

• PDFs, singly and in batch• Image files –Monographs (singly), Docs, Postcards singly and

in batch

• QA

Agenda

Page 4: CONTENTdm - OCLC

The world’s libraries. Connected.

ContextYour UsersYour CollectionsDiscovery & Delivery

Page 5: CONTENTdm - OCLC

The world’s libraries. Connected.

• Primary user stories—who are your users?

• What are the text-rich materials you have?

• YOUR users needs drive decisions about content and access methods

• YOUR data drive decisions about collection building methods

• Which tools are appropriate for your materials?

• What do different wizards expect by way of file naming and organization?

Context: Your users, Your materials

Page 6: CONTENTdm - OCLC

The world’s libraries. Connected.

What are your users looking for?Yearbooks, newspapers, ETDs…

Page 7: CONTENTdm - OCLC

The world’s libraries. Connected.

Historical postcards…

Page 8: CONTENTdm - OCLC

The world’s libraries. Connected.

Archival papers…

Page 9: CONTENTdm - OCLC

The world’s libraries. Connected.

• Browsing

• Searching--across collections, subgroups?• Known item searching, and/or

• Total recall by topic, name, etc.

• Your users expect searchable, text-rich materials• full-text search-ability across the repository, the site, the world.

How are they looking for these materials?

Page 10: CONTENTdm - OCLC

The world’s libraries. Connected.

• Metadata is key for discovery• All fields may be made searchable

• Full-text is quickly becoming necessary for discovery as well as delivery

• Linked data will eventually provide the engine for new knowledge creation

Curated AND indexed

Page 11: CONTENTdm - OCLC

The world’s libraries. Connected.

Meet users where they are

• Simple guidelines to help• Website Config tools

• Primary URL

• Automated site maps

• Moving toward linked data • Built from standard vocabularies

Page 12: CONTENTdm - OCLC

The world’s libraries. Connected.

• Search engines • Sitemaps, persistent identifiers, etc.

• Schema.org

• Becoming visible in the web-of-things• Harvesting metadata

• WorldCat

• DPLA

Visibility

Page 13: CONTENTdm - OCLC

The world’s libraries. Connected.

Data:File typesOrganizationNaming

Page 14: CONTENTdm - OCLC

The world’s libraries. Connected.

• Papers, videos, audio files

• In CONTENTdm, these are natively simple items, not compound objects, e.g.:

• .pdf

• .mp3

• .avi

Formats: Born-digital

Page 15: CONTENTdm - OCLC

The world’s libraries. Connected.

• If still to be digitized• You have control over the project specification

• File name and organization

• Metadata automatically and manually created

• If already digitized• You choose among the tools for the one that best fits your data

organization

Formats: Digitized (reformatted)

Page 16: CONTENTdm - OCLC

The world’s libraries. Connected.

• De facto standard for documents• Can have embedded text

• Portable

• Responsive design must include a PDF reader for market demands

• Preservation metadata formats

• A simple format, can ingest through CONTENTdm Web add as well as through Project Client.

Adobe PDF

Page 17: CONTENTdm - OCLC

The world’s libraries. Connected.

• CONTENTdm defined classes —when 2 or more simple items are bound together by logic (and XML):

• Documents—”flat”—a series of related items

• Postcards—exactly two digital files; two-sided items

• Monographs—”hierarchical”—items related in a hierarchy

• Six-sided views—exactly six digital files (known as “picture cube”)

CONTENTdm Compound Objects

Page 18: CONTENTdm - OCLC

The world’s libraries. Connected.

• Remember: metadata fields can be made searchable or not

• In addition, full-text, extracted from the digital object itself can be stored in a metadata field in any of three ways:

1. Generated by OCR “on-the-fly” (integrated ABBYY FineReader®)

2. Imported as .txt transcript

• Typescripted from handwritten manuscripts

or

• OCR’d in advance (external OCR engine)

3. Extracted (by server) from PDFs (if text has been created from the image to begin with)

Providing searchable text from image files

Page 19: CONTENTdm - OCLC

The world’s libraries. Connected.

Configuration:Collection, Project, &Website

Page 20: CONTENTdm - OCLC

The world’s libraries. Connected.

1. Examine the folders and files2. Access the collection administration page3. Configure PDF conversion option 4. Add appropriate metadata fields for your data

1. A searchable field with a Full text search data type2. A Tag field that is searchable and hidden.

5. Use metadata templates as much as possible1. Page level 2. Compound object level3. PDFs

Configure Collection and Project

Page 21: CONTENTdm - OCLC

The world’s libraries. Connected.

• If your materials have searchable text, you will need

Collection field:

One empty, searchable field configured as “Full text search” data type to hold text

• For “top” level records only • Website Config tool:

• to suppress display of components of compound objects in search results.

• Export via OAI-PMH

Collection and Website Configuration choices

Page 22: CONTENTdm - OCLC

The world’s libraries. Connected.

Website Config options for PDF display

Page 23: CONTENTdm - OCLC

The world’s libraries. Connected.

Adding text-rich materials

PDFsImages

Page 24: CONTENTdm - OCLC

The world’s libraries. Connected.

• Collection configuration option:• “Convert PDF to compound object”

• What it does and does NOT do.

• How/when you might override it

• Effect on the end-users’ view• A Multi-page PDF will call compound object viewer

If it has been processed as if it were a compound object

• A one-page PDF will ignore the setting and call the item viewer to display

Processing a .pdf to optimize indexing, search and display

Remember—PDFs are simple files that can be converted to compound objects—still counted AND added, as simple items

Page 25: CONTENTdm - OCLC

The world’s libraries. Connected.

• It DOES allow very large pdf files to be indexed, searched and retrieved quickly—EACH page can have 128,000 characters.

• It DOES allow end users to search for text across huge volumes of materials.

• It DOES allow the end-user to choose to view the PDF by thumbnail, by contents, with View PDF and text*, or through Page flip.

• It does NOT allow you to “nest” compound objects. I.e., you can assemble multiple PDFs as a compound object, but you cannot then take advantage of the page-level indexing, display etc., within each “page” of the compound object.

What PDF conversion does and does NOT do.

Page 26: CONTENTdm - OCLC

The world’s libraries. Connected.

PDF importingExplain and Demonstrate:

Page 27: CONTENTdm - OCLC

The world’s libraries. Connected.

• All from one folder

• Several from same folder

Add items, multiple items

Page 28: CONTENTdm - OCLC

The world’s libraries. Connected.

Images:Compound Objects

Explain and Demonstrate:

Page 29: CONTENTdm - OCLC

The world’s libraries. Connected.

• Definitions: • Compound Object—series of 2 or more items assembled together

• Wizard—”Add Compound Object”

• (single) or Multiple ( either Object List or Directory Structure)

• Type of Object (Document, Monograph, Post Card, Picture Cube)

• Method leverages data in hand (with or without tab-delimited metadata)

• Materials commonly assembled as compound objects, e.g., • Yearbooks, Papers, Postcards, Books

File organization and naming

Preparing to use the Project Client wizards

Page 30: CONTENTdm - OCLC

The world’s libraries. Connected.

• For all compound object types• Document, Monograph, Postcard, Picture cube

• For each compound object

• All digital files must reside in one directory/folder

• This is true whether you are adding multiple compound objects or a single compound object.

• And with multiple objects, all must be of same type

File and folder facts:regardless of wizard to be used in Add compound object function, SCANS are held together in one folder

Page 31: CONTENTdm - OCLC

The world’s libraries. Connected.

Example: a single Document using Add compound object

Structure by folder organization

Page 32: CONTENTdm - OCLC

The world’s libraries. Connected.

Example: a single Monographusing Add compound object

Where structured byfolder organization

Where structured by a tab-delimited text file

Page 33: CONTENTdm - OCLC

The world’s libraries. Connected.

When you add multiple compound objects using tab-d files:Their nature and placement changes

Got page-level metadata?Each object needs its own.

Got only object-level metadata?All objects share one.

Page 34: CONTENTdm - OCLC

The world’s libraries. Connected.

Compound objects using tab-d files, depending upon the Object class:The structure of the .txt file itself changes

Document: all “columns” are field attributes

Monograph: two new “columns” define structure

Page 35: CONTENTdm - OCLC

The world’s libraries. Connected.

• Monographs—with two methods for structure and transcripts• Will you OCR to get transcript?

• Method: directory (folder) structure:

• Yearbook (DEMO only OCR’d transcripts produced on the fly)

• Do you have transcripts in .txt file for each image?

• Method: tab-d file structure

• Book (Separate transcript externally produced)

Demonstrate: (single) Monograph Compound objectWizard: Compound Object

Page 36: CONTENTdm - OCLC

The world’s libraries. Connected.

• Documents—• Using Object List with .txt for each Scan directory

• Typescript Letters with page-level metadata IN .txt files (Horowitz archives)

• Postcards–• Using Directory structure

• Hand-written – some with typescripts, some without

Demonstrate: Compound ObjectsTwo wizards—each leverages the data

Page 37: CONTENTdm - OCLC

The world’s libraries. Connected.

• Administration/Edit—one-at-a-time structure, obj-level metadata

• Project Client/Find in Collection (search or browse)• Batch or single, all structure and metadata can be edited

• Edit in one or more ways, depending upon the data• Move, replace, delete, add pages (only ‘true’ compound

objects)

• Edit a transcript; find and replace across pages

• Save, upload, re-approve and re-index

Demonstrate:Edit Compound Objects

Page 38: CONTENTdm - OCLC

The world’s libraries. Connected.

• Getting help with compound objects• User Support Center

• Tutorials to study

• Installing, activating the OCR extension

• Help files related to text works

• Office hours twice monthly

• Write [email protected]

Questions & Answers

Page 39: CONTENTdm - OCLC

The world’s libraries. Connected.

• Geri Ingram

[email protected]

Questions?