CONTENTdm - OCLC

Post on 21-Oct-2021

11 views 0 download

Transcript of CONTENTdm - OCLC

Geri Ingram

Community Manager

OCLC CONTENTdm®Working with Text and PDFs

Spring 2015 CONTENTdm User Conference

Goucher College

Baltimore MD

May 27, 2015

The world’s libraries. Connected.

• To get the most out of this session

• Either you have:

• Experience building CONTENTdm collections

OR

• Attended recent CONTENTdm Training

Intended audience

The world’s libraries. Connected.

• Context-setting• Data

• Filetypes

• Organization

• Naming

• Collection Configuration• Adding text-rich digital items

• PDFs, singly and in batch• Image files –Monographs (singly), Docs, Postcards singly and

in batch

• QA

Agenda

The world’s libraries. Connected.

ContextYour UsersYour CollectionsDiscovery & Delivery

The world’s libraries. Connected.

• Primary user stories—who are your users?

• What are the text-rich materials you have?

• YOUR users needs drive decisions about content and access methods

• YOUR data drive decisions about collection building methods

• Which tools are appropriate for your materials?

• What do different wizards expect by way of file naming and organization?

Context: Your users, Your materials

The world’s libraries. Connected.

What are your users looking for?Yearbooks, newspapers, ETDs…

The world’s libraries. Connected.

Historical postcards…

The world’s libraries. Connected.

Archival papers…

The world’s libraries. Connected.

• Browsing

• Searching--across collections, subgroups?• Known item searching, and/or

• Total recall by topic, name, etc.

• Your users expect searchable, text-rich materials• full-text search-ability across the repository, the site, the world.

How are they looking for these materials?

The world’s libraries. Connected.

• Metadata is key for discovery• All fields may be made searchable

• Full-text is quickly becoming necessary for discovery as well as delivery

• Linked data will eventually provide the engine for new knowledge creation

Curated AND indexed

The world’s libraries. Connected.

Meet users where they are

• Simple guidelines to help• Website Config tools

• Primary URL

• Automated site maps

• Moving toward linked data • Built from standard vocabularies

The world’s libraries. Connected.

• Search engines • Sitemaps, persistent identifiers, etc.

• Schema.org

• Becoming visible in the web-of-things• Harvesting metadata

• WorldCat

• DPLA

Visibility

The world’s libraries. Connected.

Data:File typesOrganizationNaming

The world’s libraries. Connected.

• Papers, videos, audio files

• In CONTENTdm, these are natively simple items, not compound objects, e.g.:

• .pdf

• .mp3

• .avi

Formats: Born-digital

The world’s libraries. Connected.

• If still to be digitized• You have control over the project specification

• File name and organization

• Metadata automatically and manually created

• If already digitized• You choose among the tools for the one that best fits your data

organization

Formats: Digitized (reformatted)

The world’s libraries. Connected.

• De facto standard for documents• Can have embedded text

• Portable

• Responsive design must include a PDF reader for market demands

• Preservation metadata formats

• A simple format, can ingest through CONTENTdm Web add as well as through Project Client.

Adobe PDF

The world’s libraries. Connected.

• CONTENTdm defined classes —when 2 or more simple items are bound together by logic (and XML):

• Documents—”flat”—a series of related items

• Postcards—exactly two digital files; two-sided items

• Monographs—”hierarchical”—items related in a hierarchy

• Six-sided views—exactly six digital files (known as “picture cube”)

CONTENTdm Compound Objects

The world’s libraries. Connected.

• Remember: metadata fields can be made searchable or not

• In addition, full-text, extracted from the digital object itself can be stored in a metadata field in any of three ways:

1. Generated by OCR “on-the-fly” (integrated ABBYY FineReader®)

2. Imported as .txt transcript

• Typescripted from handwritten manuscripts

or

• OCR’d in advance (external OCR engine)

3. Extracted (by server) from PDFs (if text has been created from the image to begin with)

Providing searchable text from image files

The world’s libraries. Connected.

Configuration:Collection, Project, &Website

The world’s libraries. Connected.

1. Examine the folders and files2. Access the collection administration page3. Configure PDF conversion option 4. Add appropriate metadata fields for your data

1. A searchable field with a Full text search data type2. A Tag field that is searchable and hidden.

5. Use metadata templates as much as possible1. Page level 2. Compound object level3. PDFs

Configure Collection and Project

The world’s libraries. Connected.

• If your materials have searchable text, you will need

Collection field:

One empty, searchable field configured as “Full text search” data type to hold text

• For “top” level records only • Website Config tool:

• to suppress display of components of compound objects in search results.

• Export via OAI-PMH

Collection and Website Configuration choices

The world’s libraries. Connected.

Website Config options for PDF display

The world’s libraries. Connected.

Adding text-rich materials

PDFsImages

The world’s libraries. Connected.

• Collection configuration option:• “Convert PDF to compound object”

• What it does and does NOT do.

• How/when you might override it

• Effect on the end-users’ view• A Multi-page PDF will call compound object viewer

If it has been processed as if it were a compound object

• A one-page PDF will ignore the setting and call the item viewer to display

Processing a .pdf to optimize indexing, search and display

Remember—PDFs are simple files that can be converted to compound objects—still counted AND added, as simple items

The world’s libraries. Connected.

• It DOES allow very large pdf files to be indexed, searched and retrieved quickly—EACH page can have 128,000 characters.

• It DOES allow end users to search for text across huge volumes of materials.

• It DOES allow the end-user to choose to view the PDF by thumbnail, by contents, with View PDF and text*, or through Page flip.

• It does NOT allow you to “nest” compound objects. I.e., you can assemble multiple PDFs as a compound object, but you cannot then take advantage of the page-level indexing, display etc., within each “page” of the compound object.

What PDF conversion does and does NOT do.

The world’s libraries. Connected.

PDF importingExplain and Demonstrate:

The world’s libraries. Connected.

• All from one folder

• Several from same folder

Add items, multiple items

The world’s libraries. Connected.

Images:Compound Objects

Explain and Demonstrate:

The world’s libraries. Connected.

• Definitions: • Compound Object—series of 2 or more items assembled together

• Wizard—”Add Compound Object”

• (single) or Multiple ( either Object List or Directory Structure)

• Type of Object (Document, Monograph, Post Card, Picture Cube)

• Method leverages data in hand (with or without tab-delimited metadata)

• Materials commonly assembled as compound objects, e.g., • Yearbooks, Papers, Postcards, Books

File organization and naming

Preparing to use the Project Client wizards

The world’s libraries. Connected.

• For all compound object types• Document, Monograph, Postcard, Picture cube

• For each compound object

• All digital files must reside in one directory/folder

• This is true whether you are adding multiple compound objects or a single compound object.

• And with multiple objects, all must be of same type

File and folder facts:regardless of wizard to be used in Add compound object function, SCANS are held together in one folder

The world’s libraries. Connected.

Example: a single Document using Add compound object

Structure by folder organization

The world’s libraries. Connected.

Example: a single Monographusing Add compound object

Where structured byfolder organization

Where structured by a tab-delimited text file

The world’s libraries. Connected.

When you add multiple compound objects using tab-d files:Their nature and placement changes

Got page-level metadata?Each object needs its own.

Got only object-level metadata?All objects share one.

The world’s libraries. Connected.

Compound objects using tab-d files, depending upon the Object class:The structure of the .txt file itself changes

Document: all “columns” are field attributes

Monograph: two new “columns” define structure

The world’s libraries. Connected.

• Monographs—with two methods for structure and transcripts• Will you OCR to get transcript?

• Method: directory (folder) structure:

• Yearbook (DEMO only OCR’d transcripts produced on the fly)

• Do you have transcripts in .txt file for each image?

• Method: tab-d file structure

• Book (Separate transcript externally produced)

Demonstrate: (single) Monograph Compound objectWizard: Compound Object

The world’s libraries. Connected.

• Documents—• Using Object List with .txt for each Scan directory

• Typescript Letters with page-level metadata IN .txt files (Horowitz archives)

• Postcards–• Using Directory structure

• Hand-written – some with typescripts, some without

Demonstrate: Compound ObjectsTwo wizards—each leverages the data

The world’s libraries. Connected.

• Administration/Edit—one-at-a-time structure, obj-level metadata

• Project Client/Find in Collection (search or browse)• Batch or single, all structure and metadata can be edited

• Edit in one or more ways, depending upon the data• Move, replace, delete, add pages (only ‘true’ compound

objects)

• Edit a transcript; find and replace across pages

• Save, upload, re-approve and re-index

Demonstrate:Edit Compound Objects

The world’s libraries. Connected.

• Getting help with compound objects• User Support Center

• Tutorials to study

• Installing, activating the OCR extension

• Help files related to text works

• Office hours twice monthly

• Write contentdmsupport@oclc.org

Questions & Answers

The world’s libraries. Connected.

• Geri Ingram

• ingramg@oclc.org

Questions?