Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua...

24
Practices and Open Problems of Document Digitization For Million Book Project Xiao hui Zheng Tsinghua Univ. Lib rary

Transcript of Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua...

Page 1: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Practices and Open Problems of Document Digitization For

Million Book Project

Xiaohui Zheng

Tsinghua Univ. Library

Page 2: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Background

THU participated in CADAL Project at the end of 2002 and finished 50000 E-books and E-dissertations in Jul 2006.

Digitization Center was founded in March of 2003. Affiliated to Digital Library Research Division of THU.

Page 3: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Experiences

In house or out source Planning and Source Material

Selection Digitization Process Facility and Staff Management

Page 4: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

In house or out source In House

Pro:

1. Can control over all procedures, handling of materials and quality of products.

2. No worry about working with a vendor who turns out to be incompetent.

Page 5: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

In house or out source In HousePro:

3. Provides a foundation of experience that helps to create policies, cost analyses, standard making, and data transferring.

4. keeping the production line in house makes other digitization projects smoothly forward in the whole flexible organization.

Page 6: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

In house or out sourceIn House

Con:

1. Less staffing and workflow management experiences

2. Low productivity

3. Small Scale

Page 7: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

In house or out source Out SourcePro:

1. Professional staff and developed workflow

2. High productivity. Large output in short time.

3. Large Scale

Page 8: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Our Choice

In house operation 10 staff is enough to finish 50000 E-books

in 3 years Enough time to training staff and improve

efficiency.

Page 9: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Source Material Selection

Copyright was the place to start Easy to handle Good quality of materials (not fragile) Quickly action for submitting the title

list to duduplicate

Page 10: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Digitization Process

Preparation (Selection, Identifier assignment) Scanning Image processing Metadata creation and packaging Quality control Data storage and backup

Page 11: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Ancient book Scanning and Image processing (Double page upside down scanning)

Page 12: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

De-speckling and Centering

CADAL制作工具图像处理

Page 13: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Splitting into two pages (Batch processing)

Page 14: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Rotating (Batch processing)

Page 15: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

De-skewing (batch processing)

TPI

Page 16: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Format transferring (Batch processing)

Page 17: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Metadata creation and packaging

Page 18: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Facility and Staff Management

Facility:

Three flatbed AVA3 AVISION scanners

Two FB6000E AVISION flatbed scanner

Minolta PS 7000

High speed AVISION AV3800 Staff:

1 manager, 1 technical supervisor, 11 temp. staff

Capacity: 5,000,000 page/year

Page 19: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Network topology and data storage system

WAN

Gigabit Ethernet Switch

NAS Backup System

DAS Dell System

4 Flatbed scanners

High-speed

scanner

9 Manual processing

PCs

6 Automatic processing

PCs

LAN

Gate-way

Face- up

Scanner

Page 20: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Related Software

Scanning: QuickScan…

Image processing: Bookshop, ACDSee, XnView, UltraEdit, Scanfix, DjVuerPro,…

Cataloging and Packaging: CADAL Cataloging Tool, OEBEditor, CMDL Cataloging Toolkit,…

Data transferring: DResManages

Page 21: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Open Problems And Considerations

Content Discovery

Metadata description is rough and inconsistent

Resource Selection

The coverage of the million books is not clear and systematical.

Page 22: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Open Problems And Considerations

OCR Processing

OCR processing has not yet started. The OCR technology for ancient book is under developed.

Copyright Problem

Almost 400,000 dissertations and modern books of CADAL collection haven’t clearly copyright disclaimer .

Page 23: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Open Problems And Considerations

Organization Structure

My suggestion is that more source collection provider, less digitization centers.

Page 24: Practices and Open Problems of Document Digitization For Million Book Project Xiaohui Zheng Tsinghua Univ. Library.

Thank you for your attention!