Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data...

41
Workshop on Metadata Standards and Best Practices November 19-20 th , 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open Data Foundation [email protected] http:// www.opendatafoundation.org

Transcript of Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data...

Page 1: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

Workshop on Metadata Standards and Best PracticesNovember 19-20th, 2007

Session 4The Data Documentation Initiative

Technical Overview

Pascal Heus

Open Data Foundation

[email protected]

http://www.opendatafoundation.org

Page 2: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Outline

• XML refresher• DDI Background• DDI 1/2.x

– Status / Tools

• DDI 3.0– Use cases– Need for tools

• Conclusions / Q&A

Thanks to the DDI Alliance and GESIS for slides on DDI

Page 3: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

XML to the rescue!

• XML is driving today’s web service oriented architecture of the Internet and Intranets

• Using XML, we can capture, structure, transform, discover, exchange, query, edit and secure metadata and data

• XML is platform & language independent and can be used by everyone

• XML is both machine and human readable• XML is non-proprietary, public domain and

many open tools exist• Domain specific standards are available!

Page 4: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

XML Technical Overview

StructureDTD

XSchema

TransformXSL, XSLT

XSL-FO

DiscoverRegistriesDatabases

ExchangeWeb Services

SOAPREST

SearchXPath

XQuery

ManageSoftwareXForms

CaptureXML

Page 5: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

What is DDI?

• The Data Documentation Initiative (DDI) is an XML format for capturing metadata about survey data and register data

• Data files may remain in their native formats (ASCII files which may be delimited or fixed-width) or may be captured as XML

• It used to be designed to describe codebooks, and was mainly useful for data archives and libraries– Versions 1.*/2.*

• Now, it can be used for any type of data collection– Version 3.0– Focus on survey instrumentation and microdata, but also

can describe aggregates

Page 6: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Background

• Concept of DDI and definition of needs grew out of the data archival community

• Established in 1995 as a grant funded project initiated and organized by ICPSR

• Members:– Social Science Data Archives (US, Canada,

Europe) – Statistical data producers (including US Bureau

of the Census, the US Bureau of Labor Statistics, Statistics Canada and Health Canada)

• February 2003 – Formation of DDI Alliance– Membership-based alliance– Formalized development procedures

Page 7: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Growth of the DDI Structure

• 2000 – DDI 1.0– Simple survey– Archival data formats– Microdata only

• 2003 – DDI 2.0– Aggregate data (based on matrix structure)– Added geographic material to aid geographic

search systems and GIS users

Page 8: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Characteristics of DDI 1.0/2.0

• Focuses on the static object of a codebook• Designed for limited uses

– End user data discovery via the variable or high level study identification (bibliographic)

– Only heavily structured content relates to information used to drive statistical analysis

• Coverage is focused on single study, single data file, simple survey and aggregate data files

• Variable contains majority of information (question, categories, data typing, physical storage information, statistics)

Page 9: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

1/2.x Schema

• Organized in 5 sections– docDsrc: information about the XML document

itself: metadata preparation, version, – stdyDscr: detailed information about the survey

• Title, year, coverage, sampling, data collection/cleaning, quality, contact, access policy,

– fileDscr: describes files in the dataset– dataDscr: describes the data structure

• Variable: name, label, code, summary statistics, definitions, literal question, interviewer instructions, weights, grouping, etc.

• Cubes: aggregated data

– othMat: additional documentation

• See examples

Page 10: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Limitations

• Treated as an “add on” to the data collection process

• Focus is on the data end product and end users (static)

• Limited tools for creation or exploitation• The variable must exist before metadata can

be created• Producers hesitant to take up DDI creation

because it is a cost and does not support their development or collection process

Page 11: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Requirements for 3.0

• Improve and expand the machine-actionable aspects of the DDI to support programming and software systems

• Support CAI instruments through expanded description of the questionnaire (content and question flow)

• Support the description of data series (longitudinal surveys, panel studies, recurring waves, etc.)

• Support comparison, in particular comparison by design but also comparison-after-the fact (harmonization)

• Improve support for describing complex data files (record and file linkages)

• Provide improved support for geographic content to facilitate linking to geographic files (shape files, boundary files, etc.)

Page 12: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Approach

• Shift from the codebook centric model of early versions of DDI to a lifecycle model– providing metadata support from data study conception

through analysis and repurposing of data• Shift from an XML Data Type Definition (DTD) to an

XML Schema model– to support the lifecycle model, reuse of content and

increased controls to support programming needs• Redefine a “single DDI instance” to include a “simple

instance”– similar to DDI 1.*/2.* which covered a single study and

“complex instances” covering groups of related studies.– Allow a single study description to contain multiple data

products (for example, a microdata file and aggregate products created from the same data collection).

• Incorporate the requested functionality in the first published edition

Page 13: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Designing to support registries

• Resource package– structure to publish non-study-specific materials for reuse

• Extracting specified types of information into schemes – Universe, Concept, Category, Code, Question, Instrument,

Variable

• Allowing for either internal or external references• Providing comparison mapping

– Target can be external harmonized structure

Page 14: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Relationship to Other Standards

• Dublin Core– Mapping of citation elements– Option for DC namespace basic entry

• ISO 19115 – Geography– Search requirements– Support for GIS users

• METS– Designed to support profile development

• OAIS – Reference model for the archival lifecycle

• SDMX– Completely mapping to and from DDI NCubes– Designed to be used with registries

• ISO/IEC 11179– Variable linking representation to concept and universe– Optional data element construct in ConceptualComponent that

allows for complete ISO/IEC 11179 structure as a maintained item

Page 15: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Development of DDI 3.0

• 2004 – Acceptance of a new DDI paradigm– Lifecycle model– Shift from the codebook

centric / variable centric model to capturing the lifecycle of data

– Agreement on expanded areas of coverage

• 2005– Presentation of schema

structure– Focus on points of

metadata creation and reuse

• 2006– Presentation of first

complete 3.0 model – Internal and public

review

• 2007– Vote to move to

Candidate Version– Establishment of a set of

use cases to test application and implementation

• 2008– March: anticipated vote

to publish DDI 3.0

Page 16: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

XML Schemas and 3.0 Modules (one is not necessarily the other)

• XML Schemas– Each .xsd file is a xml schema– Some xml schemas are modules– Some xml schemas are substitution sets

or “sub-modules”– Some xml schemas simply contain

elements that are used by multiple schemas or may require more frequent updates

– Some xml schemas are “external”

Page 17: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

XML Schemas and 3.0 Modules (one is not necessarily the other)

• Modules– Reflect closely related sets of information

similar to the sections of DDI 1.*/2.* DTD– Modules can be held as separate XML

instances and be included in a large instance by either inclusion or reference

– All modules are maintainable, identifiable packages

– Each module has its own XML namespace

Page 18: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

XML SCHEMAS

• archive• comparative• conceptualcomponent• datacollection• dataset• dcelements• DDIprofile• ddi-xhtml11• ddi-xhtml11-model-1• ddi-xhtml11-modules-1• group• inline_ncube_recordlayout

• instance• logicalproduct• ncube_recordlayout• organization• physicaldataproduct• physicalinstance• reusable• simpledc20021212• studyunit• tabular_ncube_recordlayout• xml• set of xml schemas to

support xhtml

Page 19: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Basic Structure/Organization

• DDI 3.0 is divided into “modules”• Each contains a set of related metadata• Reusable metadata is divided into schemes• Modules reflect the steps of the data lifecycle

Page 20: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

DDI 3.0 Modules

• Main modules are:– Study Unit (contains a simple study description) – Conceptual Component– Data Collection (survey instruments, questions,

sources)– Logical Product (concepts, variables, codes,

categories)– Physical Storage (describes patterns of storage

and physical instances/files)– Archive (organizations and processing events)– Group (comparing and grouping study units)– Comparative (allows for explicit comparisons

between grouped studies)• See also

http://www.ddialliance.org/DDI/ddi3/module-descriptions.html

Page 21: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Maintainable Schemes (that’s with an ‘e’ not an ‘a’)

• Concept Scheme• Universe Scheme• Question Scheme• Control Construct Scheme• Category Scheme• Code Scheme• Variable Scheme

Packages of reusable metadata maintained by a single agency

Page 22: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

DDI 3.0

• Look at schema (Candidate release 2)• Look at examples (prototype XML)

Page 23: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

DDI Lifecycle View and Use Cases

Page 24: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Our Initial Thinking…

The metadata payload from version 2.* DDI was

re-organized to cover these areas.

Page 25: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Wrapper

For later parts of the lifecycle,

metadata is reused heavily

from earlierModules.

The discovery and analysis itself creates

data and metadata, re-used in future

cycles.

Page 26: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Realizations

• Many different organizations and individuals are involved throughout this process– This places an emphasis on versioning and exchange

between different systems

• There is potentially a huge amount of metadata reuse throughout an iterative cycle– We needed to make the metadata as reusable as possible

• Every organization acts as an “archive” (that is, a maintainer and disseminator) at some point in the lifecycle – When we say “archive” in DDI 3.0, it refers to this function

Page 27: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

DDI 3.0 Lifecycle Model

Metadata Reuse

Page 28: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Use Cases

• Study design/survey instrumentation• Questionnaire generation/data collection and

processing• Data recoding, aggregation and other processing• Data dissemination/discovery• Archival ingestion/metadata value-add• Question/concept/variable banks• DDI for use within a research project• Capture of metadata regarding data use• Metadata mining for comparison, etc.• Generating instruction packages/presentations

Page 29: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Study Design/Survey Instrumentation

• This use case concerns how DDI 3.0 can support the design of studies and survey instrumentation – Without benefit of a question or concept bank

Page 30: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Questionnaire Generation, Data Collection, and Processing

• This use case concerns how DDI 3.0 can support the creation of various types of questionnaires/CAI, and the collection and processing of raw data into microdata.

• Algenta working on DDI 3.0 based software

Page 31: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Data Recoding, Aggregation, etc.

• This use case concerns how DDI 3.0 can describe recodes, aggregation, and similar types of data processing.

• Relevant to both producer and researcher

Page 32: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Data Dissemination/Data Discovery

• This use case concerns how DDI 3.0 can support the discovery and dissemination of data.

• Highly relevant to researchers

Page 33: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Archival Ingestion and Metadata Value-Add

• This use case concerns how DDI 3.0 can support the ingest and migration functions of data archives and data libraries.

Page 34: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Question/Concept/Variable Banks

• This use case describes how DDI 3.0 can support question, concept, and variable banks. These are often termed “registries” or “metadata repositories” because they contain only metadata – links to the data are optional, but provide implied comparability. The focus is metadata reuse.

• Concept classification very important to researchers

Page 35: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

DDI For Use within a Research Project

• This use case concerns how DDI 3.0 can support various functions within a research project, from the conception of the study through collection and publication of the resulting data.

• Direct use in RDC

Page 36: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Capture of Metadata Regarding Data Use

• This use case concerns how DDI 3.0 can capture information about how researchers use data, which can then be added to the overall metadata set about the data sources they have accessed.

• Data use and user feedback crucial to improve overall quality and future data production (relevance)

Page 37: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Metadata Mining for Comparison, etc.

• This use case concerns how collections of DDI 3.0 metadata can act as a resource to be explored, providing further insight into the comparability and other features of a collection of data.

Page 38: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Generating Instruction Packages/Presentations

• This use case concerns how DDI 3.0 can support automation around the instruction of students and others.

Page 39: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Tools

• DDI 1/2.x– IHSN Microdata Management Toolkit (http://

www.surveynetwork.org/toolkit)– Nesstar (http://www.nesstar.com)– http://www.ddialliance.org– Dextris (http://opendatafoundation.org)

• DDI 3.0– Foundation Tools Platform– UKDA DExT– DDI 3.0 Use case http://

opendatafoundation.org/ddi/use_cases.php– Algenta SurveyWiz– Dextris (http://opendatafoundation.org)

Page 40: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

DDI for RDC

• A small set of DDI 1/2.x tools are available today– Users can generate from internal databases, use

conversion utilities (see DDI web site) or software like Nesstar Publisher and the IHSN Microdata Management Toolkit

• DDI 3.0 has a much broader scope and provides both core and advanced functionalities that will require management tools– Next generation metadata framework is being

build as the standard is begin finalized– The DDI Foundation Tools Program is an

umbrella for implementers startup toolkit

Page 41: Workshop on Metadata Standards and Best Practices November 19-20 th, 2007 Session 4 The Data Documentation Initiative Technical Overview Pascal Heus Open.

http://www.opendatafoundation.org Open Data Foundation – IZA 2007/11

Conclusions

• The first generation of DDI is suitable for data archives interested in the preservation of metadata and discovery by users

• DDI 3.0 focus on the entire life cycle of the survey and is suitable for many different uses. – More relevant to RDC environment

• DDI 3.0 calls for coordinated efforts for building relevant tools for producers, archives, researchers and other users