1 panFMP - Ein XML-basiertes Framework für Metadaten- Portale Vortrag und „hands-on“ Seminar am...

16
1 panFMP - Ein XML- basiertes Framework für Metadaten-Portale Vortrag und „hands-on“ Seminar am GFZ Potsdam Uwe Schindler MARUM – Universität Bremen PANGAEA ® - Publishing Network for Geoscientific & Environmental Data [email protected]

Transcript of 1 panFMP - Ein XML-basiertes Framework für Metadaten- Portale Vortrag und „hands-on“ Seminar am...

1

panFMP - Ein XML-basiertes Framework für Metadaten-Portale

Vortrag und „hands-on“ Seminar am GFZ Potsdam

Uwe SchindlerMARUM – Universität Bremen

PANGAEA® - Publishing Network for Geoscientific & Environmental Data

[email protected]

2

Metadata Portals: Search Technology for distributed Catalogues

• Searching directly on distributed catalogues: In distributed search infrastructures, every data provider not only has his own metadata catalogue, but also a corresponding search interface to the portal (e.g., web service based). Search requests are sent to all data providers. The portal only needs to collect the search results from the providers, then rank and display these to the end user. Examples: NSDI Clearinghouse, GeoMIS.BUND

• Harvesting catalogues into a central searchable catalogue: Every data provider has its own metadata catalogue but the search engine is centralized. The portal periodically harvests all metadata records into a central index and serves search requests from there. Major web search engines like Google or the FGDC related Geospatial One-Stop are based on this concept. The response time is optimal because only local components are used in the search process.

3

Metadata Portals: Harvesting solutions from PANGAEA®

• WDC-MARE with its information system PANGAEA® currently provides data portals for several EU/international projects:

• Not all data are stored centralized, so all datasets provided in portals must be consolidated from different sources!

• Features:– Data stays at the data providers– Metadata is harvested by the portal– Search queries are handled by the centralized catalogue

(Google-like search speed!)– Scientist gets link to data at the provider

4

Metadata Harvesting Solutions

• Web Accessible Folder (WAF): Simple harvesting by recursively collecting XML files from a web server‘s directory listing – simple, but inefficient

• Open Archives Protocol for Metadata Harvesting (OAI-PMH):

5

Open Archives Protocol• The Open Archives Initiative Protocol for Metadata H

arvesting (OAI-PMH) is a protocol developed by the Open Archives Initiative.

• Almost all digital libraries support it (most famous ones: Fedora Commons, arXiv and the CERN Document Server; GeoNetwork Opensource)

• Portals by Scientific Commons, OAIster, SUB• uses it during web crawling (if available)• Very simple to implement (XML over HTTP-REST)• Repository software for databases or file system

metadata providers is widely available (e.g. DLESE jOAI software on the data provider side)

6

Current OAI-PMH software

1. Limited to Dublin Core metadata (libraries)!

2. Limited full text search functionality due to relational databases in the background!

3. No geographic retrievals (because of Dublin Core limitation)!

4. End user interface is part of thesoftware, this limits usabilityin CMS systems.

7

Central indexing requirements

1. Open for any XML metadata format

2. Any mappings to document fields should be done by XPath/XSLT

3. Possibility to map incompatible XML schemas during harvesting by XSLT on-the-fly

4. On-the-fly validation of (maybe previously transformed) documents during harvesting

5. No relational database, only a full text search engine, that contains everything needed for operation

6. Range queries on specific fields (date/time or numeric)

7. Web service interface / programming API for the end user interface that is accessible from any language (Java/JSP, PHP, Perl,...)

8

• Ranked searching - best results returned first • Many powerful query types: phrase queries, wildcard

queries, proximity queries, range queries for date time values and numbers

• Fielded searching. All fields are searchable as a whole, each field separately (e.g. for author, parameter), or mixed.

• Any combination of boolean operators between search terms (AND, OR, NOT, exact phrase)

• Sorting by any field • Multiple-index searching with merged results • Simultaneous searching and updates due to high-

performance indexing

9

Structure of a Lucene Index

<metadata>

<title>Carbon and oxygen in benthic foraminifera</title>

<latitude>74.1</latitude>

<longitude>12.3</longitude>

</metadata>

3

<metadata>

<title>Stable oxygen and carbon isotope composition</title>

<latitude>63.9</latitude>

<longitude>11.0</longitude>

</metadata>

2

<metadata>

<title>Carbon and oxygen isotope ratios</title>

<latitude>74.1</latitude>

<longitude>11.0</longitude>

</metadata>

1

Stored document contentsID

<metadata>

<title>Carbon and oxygen in benthic foraminifera</title>

<latitude>74.1</latitude>

<longitude>12.3</longitude>

</metadata>

3

<metadata>

<title>Stable oxygen and carbon isotope composition</title>

<latitude>63.9</latitude>

<longitude>11.0</longitude>

</metadata>

2

<metadata>

<title>Carbon and oxygen isotope ratios</title>

<latitude>74.1</latitude>

<longitude>11.0</longitude>

</metadata>

1

Stored document contentsID

312.3longitude

1, 211.0longitude

1, 374.1latitude

263.9latitude

2stabletitle

1ratiostitle

1, 2, 3oxygentitle

1, 2isotopetitle

3foraminiferatitle

2compositiontitle

1, 2, 3carbontitle

3benthictitle

Document IDsText tokenField

312.3longitude

1, 211.0longitude

1, 374.1latitude

263.9latitude

2stabletitle

1ratiostitle

1, 2, 3oxygentitle

1, 2isotopetitle

3foraminiferatitle

2compositiontitle

1, 2, 3carbontitle

3benthictitle

Document IDsText tokenField

Inverted Index Documents

Terms

10

panFMP – PANGAEA® Framework for Metadata Portals

panFMP is a generic and flexible framework for building geoscientific metadata portals independent of content standards for metadata and protocols. Data providers can be harvested with commonly used protocols (e.g., Open Archives Initiative Protocol for Metadata Harvesting) and metadata standards like Dublin Core, DIF, or ISO 19115. The new Java-based portal software supports any XML encoding and makes metadata searchable through Apache Lucene. Software administrators are free to define searchable fields independent of their type using XPath and/or XSL Templates. In addition, by extending the full-text search engine (FTS) Apache Lucene, we have significantly improved queries for numerical and date/time ranges by supplying a new trie-based algorithm, thus enabling high-performance space/time retrievals in FTS-based geo portals. The harvested metadata are stored in separate indexes, which makes it possible to combine these into different portals. The portal-specific Java API and web service interface is highly flexible and supports custom front-ends for users, provides automatic query completion (AJAX), and dynamic visualization with conventional mapping tools.

11

panFMP – Components of ametadata portal

<<component>>User Frontend

Metadata Portal

<<component>>User Frontend

SOAP ornative Java

<<component>>Search API

SOAP ornative Java

<<component>>Harvester

<<component>>RangeQuery

<<component>>IndexBuilder

<<component>>Apache Lucene

<<component>><<entity>>

Configuration

panFMP - Framework for Metadata Portals

<<component>>Search API

SOAP ornative Java

<<component>>Harvester

<<component>>RangeQuery

<<component>>IndexBuilder

<<component>>Apache Lucene

<<component>><<entity>>

Configuration

User

<<component>>OAI-PMH Provider

Metadata Provider 1

<<component>>OAI-PMH Provider

<<component>>Web Server

Metadata Provider 2

<<component>>Web Server

<<datastore>>

Lucene Indexesin File System

HTTP

OAI-PMH

12

panFMP - Harvesting

<<centralBuffer>>

DOM tree

validate againstschema

<<centralBuffer>>

DOM tree

transform byXSL

apply XPath

field

apply XPath

field

add documentto index

serializeDOM

XMLblob

accept Document asDOM tree

LuceneIndex

LuceneIndex

LuceneIndex

VirtualIndex

VirtualIndex

DataProvider

DataProvider

FileSystem

OAI-PMHHarvester

OAI-PMHHarvester

DirectoryHarvester

Index Builder

Sea

rch

Inte

rfac

eS

earc

h In

terf

ace

13

panFMP - Search Interface• Supports all standard Lucene search

features• Additional support for fast range queries

to enable bounding boxes, etc.:– implemented by redundant storage of

“numerical terms” in different precisions– recursive reduction of distinct terms (every

numerical value is a term) on range query– search time no longer dependent on index

size• Accessible via Java API or AXIS web

service

14

panFMP – Range Queries

421

52

4

44 6442

644642641634633632522521448446445423

63

5 6

Example on trie-based recursive splitting of range query with three precisions(simplied for demonstration): User wants to find all records with terms between"423" and "642". Instead of selecting all terms in lowermost row, query is optimizedto only match on labelled terms with lower precision, where applicable. It is enoughto select term "5" to match all records starting with "5" ("521", "522") or "44" for"445", "446", "448". Query is therefore simplied to match all records containingterms "423", "44", "5", "63", "641", or "642".

15

Examples

• http://sedis.iodp.org

• http://www.c3grid.de/portal

• http://www.world-data-centers.org/

• http://dataportal.carboocean.org

• http://pages-dataportal.unibe.ch/cgi-bin/WebObjects/dataportal

• Currently not available: http://data.planktonnet.eu

16

Thank You!

Software available open source on Sourceforge.net!

http://www.panFMP.orghttp://sourceforge.net/projects/panfmp