1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley...

78
1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905 [email protected] JTC1 SC32N1633

Transcript of 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley...

Page 1: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

1

Future Database NeedsSC 32 Study Period

February 5, 2007

Bruce Bargmeyer, Lawrence Berkley National LaboratoryUniversity of CaliforniaTel: +1 [email protected]

JTC1 SC32N1633

Page 2: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

2

Topics

Study period purposeNew challengesA brief tutorial on Semantics and semantic

computingwhere XMDR fits

Semantic computing technologies Traditional Data Administration

Some limitations of current relational technologies Some input from other sources

Page 3: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Future Database NeedsStudy Period

A one-year study period to identify and understand case studies related to this area.

Bring together a small group of experts in a meeting on “Case Studies on new Database Standards Requirements”.  

The workshop would provide input to existing SC32 projects and may provide background material for new proposals for upgrades or for new work within SC32 in time for 2007 SC32 Plenary

--Document 32N1451

3

Page 4: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

4

The Internet Revolution

A world wide web of diverse content: The information glut is nothing new. The access to it is astonishing.

Page 5: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

5

Challenge: Find and process non-explicit data

Analgesic Agent

Non-Narcotic Analgesic

AcetominophenNonsteroidal Antiinflammatory Drug

Analgesic and Antipyretic

DatrilAnacin-3 Tylenol

For example…

Patient data on drugs contains brand names (e.g. Tylenol, Anacin-3, Datril,…);

However, want to study patients taking analgesic agents

Page 6: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

6

Challenge: Specify and compute across Relations, e.g., within a food web in an

Arctic ecosystem

                                        An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer.

Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)

Page 7: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

7

Challenge: Combine Data, Metadata & Concept Systems

ID Date Temp Hg

A 06-09-13 4.4 4

B 06-09-13 9.3 2

X 06-09-13 6.7 78

Name Datatype Definition Units

ID textMonitoring Station Identifier

not applicable

Date date Date yy-mm-dd

Temp numberTemperature (to 0.1 degree C)

degrees Celcius

Hg numberMercury contamination

micrograms per liter

Inference Search Query:“find water bodies downstream from Fletcher Creek where chemical contamination was over 10 micrograms per liter between December 2001 and March 2003”

Data:

Metadata:

Biological Radioactive

Contamination

lead cadmiummercury

Chemical

Concept system:

Page 8: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

9

Challenge: Use data from systems that record the same facts with different terms

Common Content

OASIS/ebXMLRegistries

Common Content

ISO 11179Registries

Common Content

OntologicalRegistries

Common Content

CASE ToolRepositories

Common Content

UDDIRegistries

CountryIdentifier

DataElement

XML Tag

TermHierarchy

Attribute

BusinessSpecification

TableColumn

SoftwareComponentRegistries

Common Content

Common Content

DatabaseCatalogs

BusinessObject

DublinCore

Registries

Common Content

Coverage

Page 9: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

10

Data Elements

DZ

BE

CN

DK

EG

FR

. . .

ZW

ISO 3166English Name

ISO 31663-Numeric Code

012

056

156

208

818

250

. . .

716

ISO 31662-Alpha Code

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others

ISO 3166French Name

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

ISO 31663-Alpha Code

Same Fact, Different Terms

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others

DataElementConcept

Page 10: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

11

Challenge: Draw information together from a broad range of studies, databases, reports, etc.

Page 11: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

12

Challenge: Gain Common Understanding of meaning between Data Creators and Data Users

Users Information systems

Data Creation

UsersUsers

EEA

USGS

DoD

EPAenvironagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

environagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text

ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

data

environagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

Others . . .

ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero

123345445670248591308

123345445670248591308

3268082513485038

3268082513485038270800002178

text data

A common interpretation of what the data represents

Page 12: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

13

Challenge: Drawing Together Dispersed Data

Users Information systems

Data Creation

UsersUsers

EEA

USGS

DoD

EPAenvironagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

environagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text

ambienteagriculturatiemposalud hunanoindustriaturismotierraaguaaero

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

data

environagricultureclimatehuman healthindustrytourismsoilwaterair

123345445670248591308

123345445670248591308

3268082513485038270800002178

3268082513485038270800002178

text data

Others . . .

ambienteagriculturatiemposalud hunoindustriaturismotierraaguaaero

123345445670248591308

123345445670248591308

3268082513485038

3268082513485038270800002178

text data

A common interpretation of what the data represents

Page 13: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

14

Semantic Computing

We are laying the foundation to make a quantum leap toward a substantially new way of computing: Semantic Computing

How can we make use of semantic computing? What do organizations need to do to prepare for

and stimulate semantic computing?

Page 14: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

15

Coming: A Semantic Revolution

Searching and rankingPattern analysisKnowledge discoveryQuestion answeringReasoningSemi-automated decision making

Page 15: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

16

The Nub of It

Processing that takes “meaning” into account Processing based on the relations between things

not just computing about the things themselves.Computing that takes people out of the

processing, reducing the human toil Data access, extraction, mapping, translation,

formatting, validation, inferencing, …Delivering higher-level results that are more

helpful for the user’s thought and action

Page 16: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

17

Semantics Challenges

Managing, harmonizing, and vetting semantics is essential to enable enterprise semantic computing

Managing, harmonizing and vetting semantics is important for traditional data management. In the past we just covered the basics

Enabling “community intelligence” through efforts similar to Wikipedia, Wikitionary, Flickr

Page 17: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

18

A Brief Tutorial on Semantics

What is meaning?What are concepts?What are relations?What are concept systems?What is “reasoning”?

Page 18: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

19C.K Ogden and I. A. Richards. The Meaning of Meaning.

Thought or Reference (Concept)

Referent Symbol

SymbolisesRefers to

Stands for“Rose”, “ClipArt”

Meaning: The Semiotic Triangle

Page 19: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

20

Semiotic Triangle:Concepts, Definitions and Signs

CONCEPT

Referent

Refers To Symbolizes

Stands For

“Rose”,“ClipArt”

Definition

Sign

Page 20: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

24

Definitions in the EPA Environmental Data Registry

http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress

The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box

http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode

The U.S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U.S. or Canada

http://www.epa/gov/edr/sw/AdministeredItem#StateName

The name of the state where mail is delivered

Mailing Address:

State USPS Code:

Mailing Address State Name:

Page 21: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

26

SNOMED – Terms Defined by Relations

Page 22: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

30

Computable Meaning

CONCEPT

Referent

Refers To Symbolizes

Stands For

“Rose”,“ClipArt”

rdfs:subClassOf owl:equivalentClass owl:disjointWith

If “rose” is owl:disjointWith “daffodil”, then a computer can determine that anassertion is invalid, if it states that a rose is also a daffodil (e.g., in a knowledgebase).

Page 23: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

31

Fletcher CreekMerced

Lake

WaterBody

What are Relations?

Relation

Merced Lake

Fletcher CreekMerced River

isA isA

Concepts and relations can be represented as nodes and edges in formal graph structures, e.g., “is-a” hierarchies.

Page 24: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

32

A

2

b a c d

1

Nodes represent concepts

Lines (arcs) represent relations

Concept Systems have Nodes and may have Relations

Concept systems are concepts and the relations between them.Concept systems can be represented & queried as graphs

Page 25: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

33

A More Complex Concept Graph

From Supervaluation Semantics for an Inland Water Feature OntologyPaulo Santos and Brandon Bennett http://ijcai.org/papers/1187.pdf#search=%22terminology%20water%20ontology%22

Concept lattice of inland water features

Linear LargeNon-linear

Non-linear

Large linear Small linear Small non- linear

Deep Natural

Artificial

River Stream Canal Reservoir Lake Marsh Pond

Flowing Shallow Stagnant

Page 26: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

35

Tree

Partial Order Tree

Ordered Tree

Faceted ClassificationDirected Acyclic GraphPartial Order Graph

Powerset of 3 element setBipartite Graph Clique

Compound Graph

Types of Concept System Graph Structures

Page 27: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

36

Graph Taxonomy

Directed Graph

Directed Acyclic Graph

Graph

Undirected Graph

Bipartite Graph

Partial Order Graph

Faceted Classification

Clique

Partial Order Tree

Tree

Lattice

Ordered Tree

Note: not all bipartite graphsare undirected.

Page 28: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

37

What Kind of Relations are There?Lots!

Relationship class: A particular type of connection existing between people related to or having dealings with each other.

acquaintanceOf - A person having more than slight or superficial knowledge of this person but short of friendship.

ambivalentOf - A person towards whom this person has mixed feelings or emotions.

ancestorOf - A person who is a descendant of this person. antagonistOf - A person who opposes and contends against this person. apprenticeTo - A person to whom this person serves as a trusted counselor or

teacher. childOf - A person who was given birth to or nurtured and raised by this person. closeFriendOf - A person who shares a close mutual friendship with this person. collaboratesWith - A person who works towards a common goal with this person.

Page 29: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

38

Example of relations in a food web in an Arctic ecosystem

                                        An organism is connected to another organism for which it is a source of food energy and material by an arrow representing the direction of biomass transfer.

Source: http://en.wikipedia.org/wiki/Food_web#Food_web (from SPIRE)

Page 30: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

39

Ontologies are a type of Concept System

Ontology: explicit formal specifications of the terms in the domain and relations among them (Gruber 1993)

An ontology defines a common vocabulary for researchers who need to share information in a domain. It includes machine-interpretable definitions of basic concepts in the domain and relations among them.

Why would someone want to develop an ontology? Some of the reasons are: To share common understanding of the structure of information

among people or software agents To enable reuse of domain knowledge To make domain assumptions explicit To separate domain knowledge from the operational knowledge To analyze domain knowledge

http://www.ksl.stanford.edu/people/dlm/papers/ontology101/ontology101-noy-mcguinness.html

Page 31: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

40

What is Reasoning?Inference

Polio Smallpox

Infectious Disease

Disease

is-a

is-a is-a

is-a

is-a

Diabetes Heart disease

Chronic Disease

is-a

Signifies inferred is-a relationship

Page 32: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

41

Reasoning: Taxonomies & partonomies can be used to support inference queries

Oakland Berkeley

Alameda County

California

part-of

part-of part-of

part-of

part-of

Santa Clara San Jose

Santa Clara County

part-of

E.g., if a database containsinformation on events by city,we could query that database for events that happened in a particular county or state,even though the event data does not contain explicit state or county codes.

Page 33: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

42

Reasoning: Relationship metadata can be used to infer non-explicit data

Analgesic Agent

Non-Narcotic Analgesic

AcetominophenNonsteroidal Antiinflammatory Drug

Analgesic and Antipyretic

DatrilAnacin-3 Tylenol

For example…(1) patient data on drugs currently

being taken contains brand names (e.g. Tylenol, Anacin-3, Datril,…);

(2) concept system connects different drug types and names with one another (via is-a, part-of, etc. relationships);

(3) so… patient data can be linked and searched by inferred terms like “acetominophen” and “analgesic” as well as trade names explicitly stored as text strings in the database

Page 34: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

43

Reasoning: Least Common Ancestor Query

Analgesic and Antipyretic

Analgesic Agent

Non-Narcotic Analgesic

Acetominophen

Opioid

Opiate

Morphine Sulfate

Codeine Phosphate

Nonsteroidal Antiinflammatory Drug

What is the least common ancestor concept in the NCI Thesaurus for Acetominophen and Morphine Sulfate? (answer = Analgesic Agent)

Page 35: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

44

Reasoning: Example “sibling” queries: concepts that share a common ancestor

Environmental:"siblings" of Wetland (in NASA SWEET ontology)

HealthSiblings of ERK1 finds all 700+ other kinase enzymesSiblings of Novastatin finds all other statins

11179 MetadataSibling values in an enumerated value

domain

Page 36: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

45

HealthFind all the siblings of

Breast Neoplasm

EnvironmentalFind all chemicals that are acarcinogen (cause cancer) andtoxin (are poisonous) andterratogenic (cause birth

defects)

Reasoning: More complex “sibling” queries: concepts with multiple ancestors

site neoplasms breast disorders

Breast neoplasm

RespiratorySystem

neoplasm

Non-Neoplastic

Breast Disorder

Eye neoplasm

Page 37: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

46

End of Tutorial about concept systems

What are the “Database Language” challenges?

Page 38: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

49

Metadata Registries & Database Technologies – Which Does What?

Traditional Data Registries (11179 Edition 2) Register metadata which describes data—in databases,

applications, XML Schemas, data models, flat files, paper Assist in harmonizing, standardizing, and vetting metadata Assist data engineering Provide a source of well formed data designs for system designers Record reporting requirements Assist data generation, by describing the meaning of data entry

fields and the potential valid values Register provenance information that can be provided to end

users of data Assist with information discovery by pointing to systems where

particular data is maintained.

Page 39: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

50

Data Elements

DZ

BE

CN

DK

EG

FR

. . .

ZW

ISO 3166English Name

ISO 31663-Numeric Code

012

056

156

208

818

250

. . .

716

ISO 31662-Alpha Code

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others

ISO 3166French Name

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

ISO 31663-Alpha Code

Traditional MDR:Manage Code Sets

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others

DataElementConcept

Page 40: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

51

What Can XMDR Do?

Support a new generation of semantic computing Concept system management Harmonizing and vetting concept systems Linkage of concept systems to data Interrelation of multiple concept systems Grounding ontologies and RDF in agreed upon

semantics Reasoning across XMDR content (concept

systems and metadata) Provision of Semantic Services

Page 41: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

52

We are trying to manage semantics in an increasingly complex content space

Structured dataSemi-structured dataUnstructured dataTextPictographicGraphicsMultimediaVoice video

Page 42: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

53

Case Study

Combining Concept Systems, Data, and Metadata to answer queries.

Page 43: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

54

Linking Concepts: Text Document

§ 141.62 Maximum contaminant levelsfor inorganic contaminants.(a) [Reserved](b) The maximum contaminant levelsfor inorganic contaminants specified inparagraphs (b) (2)–(6), (b)(10), and (b)(11)–(16) of this section apply to communitywater systems and non-transient,non-community water systems.The maximum contaminant level specifiedin paragraph (b)(1) of this sectiononly applies to community water systems.The maximum contaminant levelsspecified in (b)(7), (b)(8), and (b)(9)of this section apply to communitywater systems; non-transient, noncommunitywater systems; and transientnon-community water systems.Contaminant MCL (mg/l)(1) Fluoride ............................ 4.0(2) Asbestos .......................... 7 Million Fibers/liter (longerthan 10 μm).(3) Barium .............................. 2(4) Cadmium .......................... 0.005(5) Chromium ......................... 0.1(6) Mercury ............................ 0.002(7) Nitrate ............................... 10 (as Nitrogen)

§ 141.62 40 CFR Ch. I (7–1–02 Edition)

Title 40--Protection of Environment

CHAPTER I--ENVIRONMENTAL PROTECTION AGENCY PART 141--NATIONAL PRIMARY DRINKING WATER REGULATIONS

Page 44: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

55

Thesaurus Concept System(From GEMET)

Chemical Contamination

Definition The addition or presence of chemicals to, or in, another substance to such a degree as to render it unfit for its intended purpose.

Broader Term contamination

Narrower Terms cadmium contamination, lead contamination, mercury contamination

Related Terms chemical pollutant, chemical pollution

Deutsch: Chemische Verunreinigung

English (US): chemical contamination

Español: contaminación química

SOURCE General Multi-Lingual Environmental Thesaurus (GEMET)

Page 45: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

56

Concept System (Thesaurus)

Chemical

cadmium lead mercury

Biological Radioactive

chemical pollutant

chemical pollution

Contamination

Page 46: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

57

Name Acalypha ostryifolia

Mercury Mercury, bis(acetato-.kappa.O)(benzenamine)-

Mercury, (acetato-.kappa.O)phenyl-, mixt. with phenylmercuric propionate

Type Biological Organism

Chemical Chemical Chemical

CAS Number

7439-97-6 63549-47-3 No CAS Number

TSN 28189

ICTV

EPA ID E17113275 E965269

Recent Additions | Contact Us

Environmental Data Registry

Chemicals in EPA Environmental Data Registry

Page 47: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

58

Data

Monitoring StationsName Latitude Longitude Location

A 41.45 N 125.99 W Merced Lake

B 43.23 N 120.50 WMerced

River

X 39.45 N 118.12 WFletcher

Creek

ID Date Temp Hg

A 2006-09-13 4.4 4

B 2006-09-13 9.3 2

X 2006-09-15 5.2 3

X 2006-09-13 6.7 78

Measurements

A

BX

Merced Lake

Fletcher CreekMerced River

Page 48: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

59

Metadata

System Data Element Definition Units Precision

Measurements ID Monitoring Station Identifier not applicable not applicable

Measurements Date Date sample was collected not applicable not applicable

Measurements Temp Temperature degrees Celcius 0.1

Measurements Hg Mercury contamination micrograms per liter 0.004

Monitoring Stations Name Monitoring Station Identifier

Monitoring Stations Latitude Latitude where sample was taken

Monitoring Stations LongitudeLongitude where sample was taken

Monitoring Stations Location Body of water monitored

Contaminants Contaminant Name of contaminant

Contaminants Threshold Acceptable threshold value

Metadata

ContaminantsContaminant Threshold

mercury 5

lead 42?

cadmium 250?

Page 49: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

60

Relations among Inland Bodies of Water

Fletcher Creek

Merced Lake

Merced River

feeds into

feeds intoFletcher Creek Merced Lake

Merced River

fed from feeds into

Page 50: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

61

Combining Data, Metadata & Concept Systems

ID Date Temp Hg

A 06-09-13 4.4 4

B 06-09-13 9.3 2

X 06-09-13 6.7 78

Name Datatype Definition Units

ID textMonitoring Station Identifier

not applicable

Date date Date yy-mm-dd

Temp numberTemperature (to 0.1 degree C)

degrees Celcius

Hg numberMercury contamination

micrograms per liter

Inference Search Query:“find water bodies downstream from Fletcher Creek where chemical contamination was over 2 parts per billion between December 2001 and March 2003”

Data

Metadata

Biological Radioactive

Contamination

lead cadmiummercury

Chemical

Concept system

Page 51: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

62

Example – Environmental Text Corpus

Idea: Develop an environmental research corpus that could attract R&D efforts. Include the reports and other material from over $1b EPA sponsored research. Prepare the corpus and make it available

Research results from years of ORD R&D Publish associated metadata and concept

systems in XMDR Use open source software for EPA testing

Page 52: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

63

Information Extraction & Semantic Computing

Segment

Classify

Associate

Normalize

Deduplicate

Discover patterns

Select models

Fit parameters

Inference

Report results

Actionable Information

Decision Support

ExtractionEngine

11179-3(E3)

XMDR

Page 53: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

65

Metadata Registries are Useful

Registered semantics For “training” extraction engines The“Normalize” function can make use of standard

code sets that have mapping between representation forms.

The “Classify” function can interact with pre-established concept systems.

Provenance High precision for proper nouns, less precision

(e.g., 70%) for other concepts -> impacts downstream processing, Need to track precision

Page 54: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

66

Data Elements

DZ

BE

CN

DK

EG

FR

. . .

ZW

ISO 3166English Name

ISO 31663-Numeric Code

012

056

156

208

818

250

. . .

716

ISO 31662-Alpha Code

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name:Context:Definition:Unique ID: 4572Value Domain:Maintenance Org.Steward:Classification:Registration Authority:Others

ISO 3166French Name

L`Algérie

Belgique

Chine

Danemark

Egypte

La France

. . .

Zimbabwe

DZA

BEL

CHN

DNK

EGY

FRA

. . .

ZWE

ISO 31663-Alpha Code

Normalize – Need Registered and Mapped Concepts/Code Sets

Algeria

Belgium

China

Denmark

Egypt

France

. . .

Zimbabwe

Name: Country IdentifiersContext:Definition:Unique ID: 5769Conceptual Domain:Maintenance Org.:Steward:Classification:Registration Authority:Others

DataElementConcept

Page 55: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Challenge for Database Languages

The extraction database can contain graphs with > a billion nodes. Types of queries that can be done Query performance Linkage of “extract database” concepts and

relations to same concepts and relations in traditional databases.

67

Page 56: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

68

Example – 11179-3 (E3) Support Semantic Web Applications

The address state code is “AB”. This can be expressed as a directedGraph e.g., an RDF statement:

Address

AB

State Code

Node

Node

Edge

Subject

Predicate

Object

XMDR may be used to “ground” the Semantics of an RDF Statement.

Graph RDF

Page 57: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

69

Example: Grounding RDF nodes and relations: URIs Reference a Metadata Registry

dbA:ma344

“AB”^^ai:StateCode

ai: StateUSPSCode

@prefix dbA: “http:/www.epa.gov/databaseA”@prefix ai: “http://www.epa.gov/edr/sw/AdministeredItem#”

dbA:e0139

ai: MailingAddress

Page 58: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

70

Definitions in the EPA Environmental Data Registry

http://www.epa/gov/edr/sw/AdministeredItem#MailingAddress

The exact address where a mail piece is intended to be delivered, including urban-style address, rural route, and PO Box

http://www.epa/gov/edr/sw/AdministeredItem#StateUSPSCode

The U.S. Postal Service (USPS) abbreviation that represents a state or state equivalent for the U.S. or Canada

http://www.epa/gov/edr/sw/AdministeredItem#StateName

The name of the state where mail is delivered

Mailing Address:

State USPS Code:

Mailing Address State Name:

Page 59: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

72

Ontologies for Data Mapping

Concept Concept

ConceptConceptGeographic Area

Geographic Sub-Area

Country

Country Identifier

Country Name Country Code

Short Name ISO 31662-Character

Code

ISO 31663- Character

Code

Long Name

DistributorCountry Name

Mailing AddressCountry Name ISO 3166

3-Numeric CodeFIPS Code

Ontologies can help to capture and express semantics

Page 60: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

73

Example: Content Mapping Service

Collect data from many sources – files contain data that has the same facts represented by different terms. E.g., one system responds with Danemark, DK, another with DNK, another with 208; map all to Denmark.

XMDR could accept XML files with the data from different code sets and return a result mapped to a single code set.

Page 61: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

74

Actions to Manage Enterprise Semantics

Define, data, concepts, and relationsHarmonize and vet data and concept

systemsGround semantics for RDF, concept

systems, ontologies Provide semantics services

Page 62: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

75

Challenge: Concept System Store

Metadata Registry

Concept System Thesaurus Themes

DataStandards

Ontology GEMET

StructuredMetadata

UsersUsers

Concept systems:KeywordsControlled VocabulariesThesauriTaxonomiesOntologiesAxiomatized Ontologies

(Essentially graphs: node-relation-node + axioms)

}

Page 63: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

76

Challenge: Management of Concept Systems

Metadata Registry

Concept System Thesaurus Themes

DataStandards

Ontology GEMET

StructuredMetadata

UsersUsers

Concept system:RegistrationHarmonization StandardizationAcceptance (vetting)Mapping (correspondences)

}

Page 64: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

77

Challenge: Life Cycle Management

Metadata Registry

Concept System Thesaurus Themes

DataStandards

Ontology GEMET

StructuredMetadata

UsersUsers

Life cycle management:Data andConcept systems(ontologies)

}

Page 65: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

78

Challenge: Grounding Semantics

Metadata Registry

Concept System Thesaurus Themes

DataStandards

Ontology GEMET

StructuredMetadata

UsersUsers

MetadataRegistries Semantic Web

RDF TriplesSubject (node URI)Verb (relation URI)Object (node URI)

Ontologies

Page 66: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Some Limitations of Relational Technologies & SQL

Limited graph computations Weak graph query language

Limited object computations Weak object query language

Inadequate linkage of metadata to data (underspecified “catalog”) CASE tools also disable, rather than enable

data administration & semantics management

79

Page 67: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Limitations (Cont.)

Limited linkage of concept system (graphs) to data (relational, graph, object)

80

Page 68: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Some Input From WG 2 and XMDR

Look at recent work on a graph query language by David Silberberg of Johns Hopkins University Applied Physics Lab.

81

Page 69: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

David Jensen, of the University of Massachussetts Amherst ( http://kdl.cs.umass.edu/people/jensen/ ) has been developing a very interesting Proximity system and in the process has worked with complex patterns in very large data sets, including alternative query languages and database technologies. ( http://kdl.cs.umass.edu/proximity/index.html ). QGRAPH is a new visual language for querying and updating graph databases. A key feature of QGRAPH is that the user can draw a query consisting of vertices and edges with specified relations between their attributes. The response will be the collection of all subgraphs of the database that have the desired pattern.

82

Page 70: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Query languages are necessary to extract useful information from massive data sets. Moreover, annotated corpora require thousands of hours of manual annotation to create, revise and maintain. Query languages are also useful during this process. For example, queries can be used to find parse errors or to transform annotations into different schemes. However, they suffer from several problems. First, updates are not supported as query languages focus on the needs of linguists searching for syntactic

constructions. Second, their relationship to existing database query languages is poorly understood, making it difficult to

apply standard database indexing and query optimization techniques. As a consequence they do not scale well.

Finally, linguistic annotations have both a sequential and a hierarchical organization. Query languages must support queries that refer to both of these types of structure simultaneously. Such hybrid queries should have a concise syntax. The interplay between these factors has resulted in a variety of mutually-inconsistent approaches.

Catherine Lai and Steven Bird

Department of Computer Science and Software Engineering

University of Melbourne, Victoria 3010, Australia

83

Page 71: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Try to keep an eye on companies that are grappling with advanced database, knowledge management,  information extraction, and analysis requirements, such as Metamatrix, I2, NetViz, Top Quadrant, OntologyWorks, Franz, Cogito, or Objectivity, with new ones cropping up very often.    

Check out the EU sites given the large investments being made there in areas of interest. For example, KAON. 

Watch the outcome of an NSF funded project on querying linguistic databases,including annotated corpora ( http://projects.ldc.upenn.edu/QLDB/ ). Steven Bird at U. Melbourne is one of the principals on that project.

84

Page 72: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Need for graph query languages that go beyond RDF and XML Frank Olken: Make SQL a strongly typed language with respect to

measurement dimensionality. Performance: project graph structured queries against graph

structured data. Express with great difficulty the query in SQL. Complex objects. Model gets complex.  Putting humpty dumpty together again at query time.

Political problem in govt. Vendors on board, hard to pursue other technologies.

Object systems. OMG working on it? (OQL?). JAVA has ugly layer that maps into relational system. Franz has SPARQL built on top of a graph store.

85

Page 73: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

Link mining is a fairly new research area that lies at the intersection of link analysis, hypertext and web mining, relational learning and inductive logic programming, and graph mining. However, and perhaps more important, it also represents an important and essential set of techniques for constructing useful applications of data mining in a wide variety of real and important domains, especially those involving complex event detection from highly structured data. Imagine a complete “link mining

toolkit.” What would such a toolkit look like?

86

Page 74: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

Most important, it would require a language that enabled the natural representation of entities and links. Such a language would also allow for the representation of pattern templates and for specifying matches between the templates and their instantiations.

The language would have to accept an arbitrary database schema as input, with a specified mapping between relations in the database and fundamental link types in the language.

It would have to compile into efficient and rapidly executable database queries.

It would need to be able to represent grouped entities and multiple abstraction hierarchies and reason at all levels.

It would have to enable the creation of new schema elements in the database to represent newly discovered concepts.

87

Page 75: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

It would need to represent both pattern templates and pattern instances, and to have a mechanism for tracking matches between the two.

It would have to have constructs for representing fundamental relationships such as part-of, is-a, and connected-to (the most generic link relationship), as well as perhaps other high-level link types such as temporal relationships (e.g., before, after, during, overlapping, etc.), geo-spatial relationships, organizational relationships, trust relationships, and activities and events.

The toolkit would include at least one and possibly many pattern matchers. It would require tools for creating and editing patterns. It would have to include visualizations for many different types of structured data.

It would need mechanisms for handling uncertainty and confidence. It would have to track the dependence of any conclusion (e.g., pattern match or

discovered pattern) back to the underlying data, and perhaps incorporate backtracking so the impact of data corrections could be detected.

88

Page 76: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Input from WG 2 and XMDR

Link Mining Applications: Progress and Challenges - Ted E. Senator

It would need configuration management tools to track the history of discovered and matched patterns.

It would need workflow mechanisms to support multiple users in an organizational structure.

It would need mechanisms for ingesting domain-specific knowledge. It would have to be able to deal with multiple data types including text and

imagery. And it would have to be able to rapidly incorporate new link mining techniques

as they are developed. Finally, it would need to include mechanisms for maximum privacy protection.

89

Page 77: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

Where to Progress Semantics Management?

SC 32 in WG 2 and WG 3 as extensions to ongoing work or as New Work Items

W3C as XQuery, SPARQL, Semantic Web Deployment WG (RDF vocabularies, SKOS)

OMG as extensions to the MOF…

90

Page 78: 1 Future Database Needs SC 32 Study Period February 5, 2007 Bruce Bargmeyer, Lawrence Berkley National Laboratory University of California Tel: +1 510-495-2905.

91

Thanks & Acknowledgements

John McCarthy Karlo Berket Kevin Keck Frank Olken Harold Solbrig

L8 and SC 32/WG 2 Standards Committees Major XMDR Project Sponsors and Collaborators

U.S. Environmental Protection Agency Department of Defense National Cancer Institute U.S. Geological Survey Mayo Clinic Apelon