Lucas Mak and Dao Rong Gong Michigan State University Millennium and XML: Repurposing and...

39
Lucas Mak and Dao Rong Gong Michigan State University Millennium and XML: Repurposing and Customizing Metadata May 17 - 20, 2009

Transcript of Lucas Mak and Dao Rong Gong Michigan State University Millennium and XML: Repurposing and...

Lucas Mak and Dao Rong Gong

Michigan State University

Millennium and XML: Repurposing and Customizing Metadata

May 17 - 20, 2009

Today’s Outline

Overview of Metadata

Millennium system and XML

Overview of XSLT

Case Studies1. Sunday School Books Collection

2. New Book List

Conclusions and Observations

Metadata

Structured data or information about an information resource.

Types of metadata:– Descriptive– Administrative/Rights– Preservation– Technical– Structural

Descriptive Metadata

Popular descriptive metadata standards– Dublin Core (Simple & Qualified)– MODS– MARCXML– VRA Core– IEEE LOM– TEI Header– EAD

Innovative XML

XML records from Millennium

Retrieved through HTTP query

Data arrangement based on MARC fields– But MARC field and its subfields are siblings

Optimized for WebPAC display– Brief record (for search result index page display)

• Contains data from MARC 245, Publication year, record ID

– Full record (for both public and staff MARC display of individual record)

Public displayPublic display

Staff MARC display

Staff MARC display

Millennium System and XML

MillenniumMillenniumMillenniumMillennium

Delimited Delimited TextText

Delimited Delimited TextText

MARCMARCMARCMARC

XMLXMLXMLXML

/xrecord

XMLServer

OAIHarvester

Metadata Builder

Content ProContent ProContent ProContent Pro

/xrecord

XML Server

XML server query string (search for title “xslt”):

http://magic.msu.edu/xmlopac/?xml=<WXREQ_ROOT><KEY>txslt</KEY></WXREQ_ROOT>

OAI Harvester

MetaData Builder

MetaData Builder

Content Pro in Encore

XSLT

Extensible Stylesheet Language Transformation

Current version: 2.0

“Transformation” means:– Manipulation of XML documents by creating a new

document based on the original document• We recommend against multiple bullet indents

Usages in library context:– Crosswalking

• Data selection and manipulation

– Web display• Example: converting EAD into HTML for web display

XSLT

Uses XPath expressions to select/filter data node– By name of “Element”

• <xsl:for-each select="marc:leader">– By value of “Element” and/or “Attribute”

• <xsl:for-each select="marc:datafield[@tag=650 and @ind2='0']>

• <xsl:if test="$leader7='c'">

Case Study One

Sunday School Books Collection – 19th century publications by religious

societies– 170 titles digitized and cataloged

Data conversion needs– Source: Millennium– Target: Content Pro– Conversions in:

• Format: .marc to XML• Schema and Data Structure: MARC to Qualified

Dublin Core

Options for Data Migration

Create Lists

Create Lists

MARCXML

MARCXML

InnovativeXML

InnovativeXML

MARCFile

MARCFile

Content ProContent Pro(QDC)(QDC)

Content ProContent Pro(QDC)(QDC)

MillenniumMillenniumMillenniumMillennium

HTTPQuery

HTTPQuery

XSLTXSLTMARCEditMARCEdit

MARCEditMARCEdit

Segment of Innovative XML

SiblingsSiblings

MARC field/subfield as value of elementMARC field/subfield as value of element

Field indicator asvalue of elementField indicator asvalue of element

Segment of MARC21XML

Parent-ChildParent-Child

MARC field/subfield as value of element attributeMARC field/subfield as

value of element attributeField indicator as

value of element attributeField indicator as

value of element attribute

Segment of MARC21XML

Issues with Innovative XML data conversion needs– Data structured differently from MARC21XML

• Availability of existing “Innovative XML to DC/QDC” XSLT?

– Not optimized for data manipulation• Complications in data selection

» Selection of data node by matching criteria against values in individual elements

» A series of matching may be needed for selecting just one node

• Efficiency in processing» Multiple upward, downward, and lateral movement

involved in data selection

Final Path of Data Migration

Create Lists

Create Lists

MARCXML

MARCXML

MARCFile

MARCFile Content ProContent Pro

(QDC)(QDC)

Content ProContent Pro(QDC)(QDC)

MillenniumMillennium(.marc)(.marc)

MillenniumMillennium(.marc)(.marc)

XSLTXSLT

MARCEditMARCEdit

MARCEditMARCEdit

Design of XSLT

Based on LC’s “MARC To Simple DC” XSLT

– Customized mappings according to LC’s suggestions

– Crosswalking strategies• Conditional processing (i.e. matching)

• boolean ( ), contains ( ), starts-with ( )• <xsl:if>, <xsl:choose>, <xsl:when>

• String manipulation• Used in both conditional processing and data selection for

output• substring ( ), substring-before ( ), substring-after ( ),

translate ( ), concat ( ), normalize-space ( )

Design of XSLT

Conditional Processing & String Manipulation in De-duplication<xsl:for-each

select="marc:datafield[@tag=246]/marc:subfield[@code='a']"> <xsl:if test="not(contains($dataField245Lower,

translate(substring(normalize-space(.),1,string-length()-1),

$upperCase,$lowerCase)))"> <xsl:element name="dcterms:alternative"> <xsl:value-of select="normalize-space

(substring(.,1,string-length()-1))"/>

</xsl:element>

</xsl:if>

</xsl:for-each>

Converts 245 & 246 into lower case before comparing

Chop trailing period (.)

Compare MARC 246 against MARC 245

Design of XSLT

No <dcterms:alternative> for MARC 246

Design of XSLT

Predicate

• Used for data selection and de-duplication

<!-- Output MARC 650y as <dcterms:temporal> -->

<xsl:for-each select="marc:datafield[@tag=650 and @ind2='0']

[not(marc:subfield[@code='y'] = preceding-sibling::marc:

datafield[@tag=650 and @ind2='0']/marc:subfield[@code='y'])]/

marc:subfield[@code='y']"> <xsl:element name="dcterms:temporal"> <xsl:value-of select="normalize-space(self::node())"/> </xsl:element> </xsl:for-each>

Selects LCSH only

Selects unique

650$y only

Design of XSLT

Hard-coding

Inserted elements that are global to all records

<!-- Output <dc:format>application/pdf</dc:format> --><xsl:element name="dc:format">

<xsl:text>application/pdf</xsl:text></xsl:element>

Segment of Source MARCXML

Segment of Output QDC XML

Case Study Two

Library’s book lists

Issues with featured list

Existing New Book List – Newly cataloged books for browse shelf– New approach using XML and XSLT

New features design– Sorting– RSS feed– Customization

Case Study Two

New Book List Based on XML File

Millennium XML server outputs two files– Entire new book list over a rolling period of

time– List of daily added books

New Book List program output– Book List in HTML format– RSS feed for daily added books

Path of Data Processing

Web ServerWeb Server& php& php

Web ServerWeb Server& php& phpMillenniumMillenniumMillenniumMillennium

EXPECTEXPECT XSLTXSLT Internet

XML output

Design of XSLT

Design of XSLT

Design of XSLT

Putting It Together

Putting It Together

Observations and Challenges

Millennium System and XML– XSLT processor within Millennium and

customizing Innovative XML output

Using XML as data source– Large XML file size

XSLT and data processing– XSLT data manipulation– Lack of built-in functions for conditional data

looping etc.