CONTENTdm - OCLC
Transcript of CONTENTdm - OCLC
Geri Ingram
Community Manager
OCLC CONTENTdm®Working with Text and PDFs
Spring 2015 CONTENTdm User Conference
Goucher College
Baltimore MD
May 27, 2015
The world’s libraries. Connected.
• To get the most out of this session
• Either you have:
• Experience building CONTENTdm collections
OR
• Attended recent CONTENTdm Training
Intended audience
The world’s libraries. Connected.
• Context-setting• Data
• Filetypes
• Organization
• Naming
• Collection Configuration• Adding text-rich digital items
• PDFs, singly and in batch• Image files –Monographs (singly), Docs, Postcards singly and
in batch
• QA
Agenda
The world’s libraries. Connected.
ContextYour UsersYour CollectionsDiscovery & Delivery
The world’s libraries. Connected.
• Primary user stories—who are your users?
• What are the text-rich materials you have?
• YOUR users needs drive decisions about content and access methods
• YOUR data drive decisions about collection building methods
• Which tools are appropriate for your materials?
• What do different wizards expect by way of file naming and organization?
Context: Your users, Your materials
The world’s libraries. Connected.
What are your users looking for?Yearbooks, newspapers, ETDs…
The world’s libraries. Connected.
Historical postcards…
The world’s libraries. Connected.
Archival papers…
The world’s libraries. Connected.
• Browsing
• Searching--across collections, subgroups?• Known item searching, and/or
• Total recall by topic, name, etc.
• Your users expect searchable, text-rich materials• full-text search-ability across the repository, the site, the world.
How are they looking for these materials?
The world’s libraries. Connected.
• Metadata is key for discovery• All fields may be made searchable
• Full-text is quickly becoming necessary for discovery as well as delivery
• Linked data will eventually provide the engine for new knowledge creation
Curated AND indexed
The world’s libraries. Connected.
Meet users where they are
• Simple guidelines to help• Website Config tools
• Primary URL
• Automated site maps
• Moving toward linked data • Built from standard vocabularies
The world’s libraries. Connected.
• Search engines • Sitemaps, persistent identifiers, etc.
• Schema.org
• Becoming visible in the web-of-things• Harvesting metadata
• WorldCat
• DPLA
Visibility
The world’s libraries. Connected.
Data:File typesOrganizationNaming
The world’s libraries. Connected.
• Papers, videos, audio files
• In CONTENTdm, these are natively simple items, not compound objects, e.g.:
• .mp3
• .avi
Formats: Born-digital
The world’s libraries. Connected.
• If still to be digitized• You have control over the project specification
• File name and organization
• Metadata automatically and manually created
• If already digitized• You choose among the tools for the one that best fits your data
organization
Formats: Digitized (reformatted)
The world’s libraries. Connected.
• De facto standard for documents• Can have embedded text
• Portable
• Responsive design must include a PDF reader for market demands
• Preservation metadata formats
• A simple format, can ingest through CONTENTdm Web add as well as through Project Client.
Adobe PDF
The world’s libraries. Connected.
• CONTENTdm defined classes —when 2 or more simple items are bound together by logic (and XML):
• Documents—”flat”—a series of related items
• Postcards—exactly two digital files; two-sided items
• Monographs—”hierarchical”—items related in a hierarchy
• Six-sided views—exactly six digital files (known as “picture cube”)
CONTENTdm Compound Objects
The world’s libraries. Connected.
• Remember: metadata fields can be made searchable or not
• In addition, full-text, extracted from the digital object itself can be stored in a metadata field in any of three ways:
1. Generated by OCR “on-the-fly” (integrated ABBYY FineReader®)
2. Imported as .txt transcript
• Typescripted from handwritten manuscripts
or
• OCR’d in advance (external OCR engine)
3. Extracted (by server) from PDFs (if text has been created from the image to begin with)
Providing searchable text from image files
The world’s libraries. Connected.
Configuration:Collection, Project, &Website
The world’s libraries. Connected.
1. Examine the folders and files2. Access the collection administration page3. Configure PDF conversion option 4. Add appropriate metadata fields for your data
1. A searchable field with a Full text search data type2. A Tag field that is searchable and hidden.
5. Use metadata templates as much as possible1. Page level 2. Compound object level3. PDFs
Configure Collection and Project
The world’s libraries. Connected.
• If your materials have searchable text, you will need
Collection field:
One empty, searchable field configured as “Full text search” data type to hold text
• For “top” level records only • Website Config tool:
• to suppress display of components of compound objects in search results.
• Export via OAI-PMH
Collection and Website Configuration choices
The world’s libraries. Connected.
Website Config options for PDF display
The world’s libraries. Connected.
Adding text-rich materials
PDFsImages
The world’s libraries. Connected.
• Collection configuration option:• “Convert PDF to compound object”
• What it does and does NOT do.
• How/when you might override it
• Effect on the end-users’ view• A Multi-page PDF will call compound object viewer
If it has been processed as if it were a compound object
• A one-page PDF will ignore the setting and call the item viewer to display
Processing a .pdf to optimize indexing, search and display
Remember—PDFs are simple files that can be converted to compound objects—still counted AND added, as simple items
The world’s libraries. Connected.
• It DOES allow very large pdf files to be indexed, searched and retrieved quickly—EACH page can have 128,000 characters.
• It DOES allow end users to search for text across huge volumes of materials.
• It DOES allow the end-user to choose to view the PDF by thumbnail, by contents, with View PDF and text*, or through Page flip.
• It does NOT allow you to “nest” compound objects. I.e., you can assemble multiple PDFs as a compound object, but you cannot then take advantage of the page-level indexing, display etc., within each “page” of the compound object.
What PDF conversion does and does NOT do.
The world’s libraries. Connected.
PDF importingExplain and Demonstrate:
The world’s libraries. Connected.
• All from one folder
• Several from same folder
Add items, multiple items
The world’s libraries. Connected.
Images:Compound Objects
Explain and Demonstrate:
The world’s libraries. Connected.
• Definitions: • Compound Object—series of 2 or more items assembled together
• Wizard—”Add Compound Object”
• (single) or Multiple ( either Object List or Directory Structure)
• Type of Object (Document, Monograph, Post Card, Picture Cube)
• Method leverages data in hand (with or without tab-delimited metadata)
• Materials commonly assembled as compound objects, e.g., • Yearbooks, Papers, Postcards, Books
File organization and naming
Preparing to use the Project Client wizards
The world’s libraries. Connected.
• For all compound object types• Document, Monograph, Postcard, Picture cube
• For each compound object
• All digital files must reside in one directory/folder
• This is true whether you are adding multiple compound objects or a single compound object.
• And with multiple objects, all must be of same type
File and folder facts:regardless of wizard to be used in Add compound object function, SCANS are held together in one folder
The world’s libraries. Connected.
Example: a single Document using Add compound object
Structure by folder organization
The world’s libraries. Connected.
Example: a single Monographusing Add compound object
Where structured byfolder organization
Where structured by a tab-delimited text file
The world’s libraries. Connected.
When you add multiple compound objects using tab-d files:Their nature and placement changes
Got page-level metadata?Each object needs its own.
Got only object-level metadata?All objects share one.
The world’s libraries. Connected.
Compound objects using tab-d files, depending upon the Object class:The structure of the .txt file itself changes
Document: all “columns” are field attributes
Monograph: two new “columns” define structure
The world’s libraries. Connected.
• Monographs—with two methods for structure and transcripts• Will you OCR to get transcript?
• Method: directory (folder) structure:
• Yearbook (DEMO only OCR’d transcripts produced on the fly)
• Do you have transcripts in .txt file for each image?
• Method: tab-d file structure
• Book (Separate transcript externally produced)
Demonstrate: (single) Monograph Compound objectWizard: Compound Object
The world’s libraries. Connected.
• Documents—• Using Object List with .txt for each Scan directory
• Typescript Letters with page-level metadata IN .txt files (Horowitz archives)
• Postcards–• Using Directory structure
• Hand-written – some with typescripts, some without
Demonstrate: Compound ObjectsTwo wizards—each leverages the data
The world’s libraries. Connected.
• Administration/Edit—one-at-a-time structure, obj-level metadata
• Project Client/Find in Collection (search or browse)• Batch or single, all structure and metadata can be edited
• Edit in one or more ways, depending upon the data• Move, replace, delete, add pages (only ‘true’ compound
objects)
• Edit a transcript; find and replace across pages
• Save, upload, re-approve and re-index
Demonstrate:Edit Compound Objects
The world’s libraries. Connected.
• Getting help with compound objects• User Support Center
• Tutorials to study
• Installing, activating the OCR extension
• Help files related to text works
• Office hours twice monthly
• Write [email protected]
Questions & Answers
The world’s libraries. Connected.
• Geri Ingram
Questions?