Data Integration, Analysis, and Synthesis

38
Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment http://knb.ecoinformatics.org nding: National Science Foundation (DEB99-80154, DBI99-04777)

description

Data Integration, Analysis, and Synthesis. Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment. http://knb.ecoinformatics.org - PowerPoint PPT Presentation

Transcript of Data Integration, Analysis, and Synthesis

Page 1: Data Integration, Analysis, and Synthesis

Data Integration, Analysis, and Synthesis

Matthew B. JonesNational Center for Ecological Analysis and Synthesis

University of California Santa Barbara

Scalable Information Networks for the Environment

http://knb.ecoinformatics.org

Funding: National Science Foundation (DEB99-80154, DBI99-04777)

Page 2: Data Integration, Analysis, and Synthesis

NCEAS’ Mission

Integrate existing data for broad ecological synthesis

Use synthesis to inform policy and management

Page 3: Data Integration, Analysis, and Synthesis

Synthesis at NCEAS

Research Management Policy

200+ synthesis projects 1900+ participating scientists

Page 4: Data Integration, Analysis, and Synthesis

Research projects Hunsaker – Quantification of Uncertainty in

Spatial Data for Ecological Applications Ives & Frost – Intrinsic and Extrinsic Variability

in Community Dynamics Osenberg -- Meta-Analysis, Interaction

Strength and Effect Size; Application of Biological Models to the Synthesis of Experimental Data

Murdoch – Complex Population Dynamics

Page 5: Data Integration, Analysis, and Synthesis

Management projects Andelman – Designing and Assessing the

Viability of Nature Reserve Systems at Regional Scales: Integration of Optimization, Heuristic and Dynamic Models

Boersma & Kareiva – Prospectus For An Analysis of Recovery Plans and Delisting

Kareiva – Habitat Conservation Planning for Endangered Species

Lubchenco, Palumbi, & Gaines – Developing the Theory of Marine Reserves

Page 6: Data Integration, Analysis, and Synthesis

Policy projects Costanza & Farber -- The Value of the World's

Ecosystem Services and Natural Capital: Toward a Dynamic, Integrated Approach

http://www.nceas.ucsb.edu/

Page 7: Data Integration, Analysis, and Synthesis

Synthesis projects

Use existing data...

Distributed sources Varying protocols Varying formats

Obtained via personal collaboration

Page 8: Data Integration, Analysis, and Synthesis

Functional breakdown Functional breakdown for synthesis

Data discovery Data access Data storage Data interpretation

Quality assessment Data Conversion & Integration Analysis & Modeling Visualization

Page 9: Data Integration, Analysis, and Synthesis

Presentation Outline Integration, Analysis, and

Synthesis:

Challenges

Page 10: Data Integration, Analysis, and Synthesis

Population survey Experimental Taxonomic survey Behavioral Meteorological Oceanographic Hydrology …

Data Heterogeneity Economic Social (urban

ecology) Paleoecological Historical

Land use Demographics

Page 11: Data Integration, Analysis, and Synthesis

Types of Heterogeneity Intensional vs. Arbitrary Heterogeneity

Syntax (format) CSV, Fixed ASCII, proprietary binary

Schema (organization) Non-normalized models

Semantics (meaning/methods) Protocol semantics (e.g., scale) Parameter semantics (e.g., bodysize (g)) Conceptual framework (e.g., experimental trts) Taxonomy + nomenclature

Page 12: Data Integration, Analysis, and Synthesis

Data Dispersion Data are distributed among:

Independent researcher holdings Research station collections

LTER Network (24 sites) Org. of Biological Field Stations (168 sites) Univ. Cal Natural Reserve System (36 sites) MARINE (62 sites) PISCO

Agency databases Museum databases

Access via personal networking Not scalable

Page 13: Data Integration, Analysis, and Synthesis

Lack of Metadata Majority of ecological data

undocumented Lack information on syntax, schema and

semantics of data Impossible to understand data without

contacting the original researchers

Documentation conventions widely vary Requires large time investment to

understand each data set

Page 14: Data Integration, Analysis, and Synthesis

Scaling Data Integration Because of:

Data heterogeneity Data dispersion Lack of documentation

Integration and synthesis are limited to a manual process Thus, difficult to scale integration

efforts up to large numbers of data sets

Page 15: Data Integration, Analysis, and Synthesis

Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3

Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8

Date Site Species Density 10/1/1993 N654 Picea

rubens 13

10/3/1994 N654 Picea rubens

14.5

10/1/1993 N654 Betula papyifera

3

10/31/1993 1 Picea rubens

13.5

10/31/1993 1 Betula papyifera

1.6

11/14/1994 1 Picea rubens

8.4

11/14/1994 1 Betula papyifera

1.8

A

B

C

Page 16: Data Integration, Analysis, and Synthesis

Presentation Outline Integration, Analysis, and

Synthesis:

Challenges Current work

Knowledge Network for Biocomplexity Partnership for Biodiversity Informatics

Page 17: Data Integration, Analysis, and Synthesis

Knowledge Network for Biocomplexity (KNB) National network for biocomplexity

data Data discovery Data access Data interpretation

Enable advanced services Data integration Analysis framework Hypothesis modeling Visualization

Page 18: Data Integration, Analysis, and Synthesis

Central Role of Metadata What metadata?

Ownership, attribution, structure, contents, methods, quality, etc.

Critical for addressing data heterogeneity issues

Critical for developing extensible systems

Critical for long-term data preservation

Allows advanced services to be built

Page 19: Data Integration, Analysis, and Synthesis

KNB Components Ecological Metadata Language (EML) Morpho -- data management for ecologists

Cross platform Java application Metacat -- flexible metadata & data system

Analysis and Modeling engine Data integration engine Semantic Query Processor Hypothesis Modeling Engine

Page 20: Data Integration, Analysis, and Synthesis

Ecological Metadata Language

XML syntax for representing metadata

Extensible – can add new metadata

Modular – can subset metadata for specific applications

Page 21: Data Integration, Analysis, and Synthesis

EML 2.0beta3 modules eml-resource -- Basic resource info eml-dataset -- Data set info eml-literature -- Citation info eml-software -- Software info eml-party -- People and Organizations

eml-entity -- Data entity (table) info eml-attribute -- Attribute (variable) info eml-constraint -- Integrity constraints eml-physical -- Physical format info eml-access -- Access control eml-distribution -- Distribution info

eml-project -- Research project info eml-coverage -- Geographic, temporal and taxonomic coverage eml-protocol -- Methods and QA/QC

Page 22: Data Integration, Analysis, and Synthesis
Page 23: Data Integration, Analysis, and Synthesis

Metacat metadata system

LTERMetacat

NCEASMetacat

Metacat Catalog

Morpho clients

Key

SDSCMetacatSite metadata system

AND

SEV

CAP

OBFS

Web clients

XML wrapper

NRSMetacat

SEVMetacat

Page 24: Data Integration, Analysis, and Synthesis

Metacat architectureMetacat Server

RDBMS(Oracle)

TransformationSubsystem

LDAP

Java

Ser

vlet

Eng

ine

(Tom

cat)

HTT

P Se

rver

(Apa

che)

JDBCAPI

LDAPAdapter

Met

acat

Ser

vlet

(Dis

patc

her)

AuthenticationInterface

StorageSubsystem

QuerySubsystem

ReplicationSubsystem

ValidationSubsystem

Data StorageInterface

FSAdapter

File System

Page 25: Data Integration, Analysis, and Synthesis

Metacat web interface

Page 26: Data Integration, Analysis, and Synthesis

UCNatural Reserve System

OBFS Network

LTERNetwork

Page 27: Data Integration, Analysis, and Synthesis

Functional breakdown Functional breakdown for synthesis

Data discovery Data access Data storage Data interpretation

Quality assessment Data Conversion & Integration Analysis & Modeling Visualization

Page 28: Data Integration, Analysis, and Synthesis

Quality Assessment system

SemanticMetadata+

+ + ResearcherDecisionsData

QualityAssessmentReport

Page 29: Data Integration, Analysis, and Synthesis

Quality Assessment Integrity constraint checking Data type checking Metadata completeness Data entry errors Outlier detection Check assertions about data

e.g., trees don’t shrink e.g., sea urchins do

Page 30: Data Integration, Analysis, and Synthesis

Data IntegrationSemanticMetadata+

+ + ResearcherDecisionsData

Date Site Species Density10/1/1993 N654 Picea

rubens13

10/3/1994 N654 Picearubens

14.5

10/1/1993 N654 Betulapapyifera

3

10/31/1993 1 Picearubens

13.5

10/31/1993 1 Betulapapyifera

1.6

11/14/1994 1 Picearubens

8.4

11/14/1994 1 Betulapapyifera

1.8

IntegratedData Set

Page 31: Data Integration, Analysis, and Synthesis

Data IntegrationDate Site Species Area Count 10/1/1993 N654 PIRU 2 26 10/3/1994 N654 PIRU 2 29 10/1/1993 N654 BEPA 1 3

Date Site picrub betpap31Oct1993 1 13.5 1.614Nov1994 1 8.4 1.8

Date Site Species Density 10/1/1993 N654 Picea

rubens 13

10/3/1994 N654 Picea rubens

14.5

10/1/1993 N654 Betula papyifera

3

10/31/1993 1 Picea rubens

13.5

10/31/1993 1 Betula papyifera

1.6

11/14/1994 1 Picea rubens

8.4

11/14/1994 1 Betula papyifera

1.8

A

B

C 0

2

4

6

8

10

12

14

16

Pice

a ru

bens

Pice

a ru

bens

Betu

la p

apyi

fera

Pice

a ru

bens

Betu

la p

apyi

fera

Pice

a ru

bens

Betu

la p

apyi

fera

Dens

ity (#

/m2)

Page 32: Data Integration, Analysis, and Synthesis

Scaling Analysis and Modeling

Data and Metadata Input

(from Morpho/Metacat)

Execution engine (plugins)

SASR

MatlabSimulation models

...

Analysis + Model Metadata

InputsOutputs

Processing

Output

Page 33: Data Integration, Analysis, and Synthesis

Scaling Analysis and Modeling

Execution Engine

Data and Metadata InputConfiguration for Analysis and Models

DDLSpecification(Inputs andDDL Code)

ProceduralSpecification(Inputs andproc code)

Input MapSpecification(test inputsmapped to

metadata/datafields)

Script withunresolvedvariables

Input MapParser

TestSpecification

Parser

Script withsymbolically

resolvedvariables

Script/Metadata/Data Validation

and ConflictResolution

User orontological

input forconflict

resolution

Data/MetadataInput facilitator

and Parser

DataPackage

(Metadatawith data

file)

Fullyresolved

final scriptScriptExecutor

Output(HTML,

XML, Text,etc.)

Script withsome fullyresolvedvariables

AnalyticalEnginePlugin

OutputStream from

AnalyticalEngine

OuputRenderer

OuputConfig File

Page 34: Data Integration, Analysis, and Synthesis
Page 35: Data Integration, Analysis, and Synthesis

Semantic metadata Describes the relationship between

measurements and ecologically relevant concepts

Drawn from a controlled vocabulary Ontology for ecological

measurements

Page 36: Data Integration, Analysis, and Synthesis

Ecological Ontologies

BiodiversitySpecies TaxonOrganism

SpeciesEveness (J')

ShannonDiversity (H')

S

ii ppH1

ln'

SHJ

ln''Species

Count (S)

Abundance (N)

Abundance ofSpecies i (Ni)

SamplingArea (A)

ProportionalAbundance

Species i (pi)

NNp i

i

isaisa

has

has

has

has

has

S

iNN1

Page 37: Data Integration, Analysis, and Synthesis

What drives synthesis Science questions Hypotheses Analyses + Models Integrated Data Original Data

Page 38: Data Integration, Analysis, and Synthesis

Conclusions

Barriers to integration can be addressed using structured metadata

Can accomplish a lot with ‘just’ mechanical transformations

Domain ontologies + semantic mediation are paths to scaling integration

Analysis drives all other phases of integration