6/27/2015 Henrico Dolfing Seminar Digital Information Curation MONDRIAN: Annotating and querying...

36
03/25/22 Henrico Dolfing Seminar Digital Information Curation querying databases through colors and blocks
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 6/27/2015 Henrico Dolfing Seminar Digital Information Curation MONDRIAN: Annotating and querying...

04/18/23

Henrico Dolfing

Seminar Digital Information Curation

MONDRIAN: Annotating and querying databases through colors and blocks

Digital Information Curation 2 04/18/23

Outline

Introduction Colors and Blocks Color Algebra Mondrian System Discussion References

Digital Information Curation 3 04/18/23

Introduction

Geerts, F., Kementsietsidis, A., Milano, D., “MONDRIAN: Annotating and querying databases through color and blocks”, accepted for ICDE 2006

Annotation-oriented data model for manipulating and querying both data and annotations.

MONDRIAN, a prototype implementation of the annotation mechanism

Digital Information Curation 4 04/18/23

Motivation

Scientific databases Huge amounts of data Different formats (flat text, images, xml, ...)

Challenges Integrate, annotate and cross reference such diverse

collections of data. Maintain data provenance

Pressing needs of biological databases

Digital Information Curation 5 04/18/23

Use Case (1/2)

GDB, a human genome database

Swissprot, a proteine database

gid gname chr120231 NF1 17120232 NF2 22120233 NGFB 1120234 NGFR 17120235 NHS 21

pid pnameA01399 Nerve growth factorA25218 Tumor necrosis factorA 45770

Merlin

I78852 NeurofibromatisQ6T45 Nancy-Horan

syndrome

Digital Information Curation 6 04/18/23

Use Case (2/2)

PIR, a protein sequence database

SwissProt & PIR UniProt

sid sname originP01138 Nerve growth factor Huma

nP08138 TNR16 Huma

nP14543 Nidogen Huma

nP21359 Neurofibromin Huma

nP35240 Merlin Huma

n

Digital Information Curation 7 04/18/23

Colors and Blocks (1/2)

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

John

Mary

Peter

Mary

John

John,Mary

Digital Information Curation 8 04/18/23

Colors and Blocks (2/2)

Block = annotated group of attribute values Color = each annotation is represented by a color

Block overlapping Inheritance Transitivity

Color Queries = queries on annotated databases, that are written in a “Color Algebra”

Digital Information Curation 9 04/18/23

Color Algebra (1/2)

Projection Selection Cartesian product Block selection Block projections Merge Recoloring Renaming Union

Digital Information Curation 10 04/18/23

Color Algebra (2/2)

Definition: The color algebra consists of all expressions obtained by composing a finite number of the operators.

Theorem: The set of operators in the color algebra is minimal

Digital Information Curation 11 04/18/23

Projection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

Digital Information Curation 12 04/18/23

L-Type Block Projection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

I78825

A45770

A25218

PID

120231

120232

120234

GID

Digital Information Curation 13 04/18/23

U-Type Block Projection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

Digital Information Curation 14 04/18/23

Combined Block Projection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

I78825

A45770

A01399

PID

120231

120232

120233

GID

Digital Information Curation 15 04/18/23

Query example

Consider original relation in our use case.

Assume we want to find all the tuples that have a block annotated by Mary, or concern the protein with sid P038138.

Assume we are only interested in keeping the {gid,sid} attributes from these tuples.

Digital Information Curation 16 04/18/23

Block Selection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

John

Mary

Peter

Mary

John

John,Mary

Digital Information Curation 17 04/18/23

Block Selection

I78825

A45770

A01399

PID

120231

120232

120233

GID

P21359

P35240

P01138

SID

Mary

Mary

John,Mary

Digital Information Curation 18 04/18/23

Selection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

Digital Information Curation 19 04/18/23

Selection

A25218

PID

120234

GID

P08138

SID

Digital Information Curation 20 04/18/23

Union

I78825

A45770

A01399

PID

120231

120232

120233

GID

P21359

P35240

P01138

SID

Mary

Mary

John,Mary

A25218 120234 P08138

Digital Information Curation 21 04/18/23

Union

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

Mary

Mary

John,Mary

Digital Information Curation 22 04/18/23

Projection

I78825

A45770

A01399

A25218

PID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

Mary

Mary

John,Mary

Digital Information Curation 23 04/18/23

Projection

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

Mary

Mary

John,Mary

Digital Information Curation 24 04/18/23

Cartesian Product

I78825

A45770

A25218

PID

120231

120232

120234

GID

120231

120232

120233

120234

GID

P21359

P35240

P01138

P08138

SID

Digital Information Curation 25 04/18/23

Cartesian Product

I78825

A45770

A25218

PID

120231

120232

120234

GID

120231

120232

120234

GID’

P21359

P35240

P08138

SID’

Digital Information Curation 26 04/18/23

Merge

Projecting out GID’

I78825

A45770

A25218

PID

120231

120232

120234

GID

P21359

P35240

P08138

SID’

Digital Information Curation 27 04/18/23

Merge

Projecting out GID

I78825

A45770

A25218

PID

120231

120232

120234

GID’

P21359

P35240

P08138

SID’

Digital Information Curation 28 04/18/23

120231

120232

120234

Merge

I78825

A45770

A25218

PID GID’

P21359

P35240

P08138

SID’

Digital Information Curation 29 04/18/23

Mondrian System

Piet Mondria(a)n: Dutch painter whose paintings mainly consist of color blocks

Victory Boogie Woogie (€ 40.000.000)

Digital Information Curation 30 04/18/23

Desirable properties

No restructuring of the existing database schema Only extra tables need to be added

Minimum overhead in terms of Space Query execution time

Annotations should be treated as first class citizens of the database, ie be able to query them

Digital Information Curation 31 04/18/23

Current state of Mondrian System

Text basedCA Query

EquivalentCRA Query

EquivalentSQL Query

MySQLRelational

DBMS

Result

Graphical CA Query

Digital Information Curation 32 04/18/23

Relational Representation

Assume assoc(pid,bpid), assoc(gid,bgid) and assoc (sid,bsid)

Data is separated from annotation representation

pid gid sid bpid bgid bsid ٦I78852

12031

P21359 0 0 0 C

I78852

12031

P21359 1 1 0 John

I78852

12031

P21359 0 1 1 Mary

Digital Information Curation 33 04/18/23

Current state

Text basedCA Query

EquivalentCRA Query

EquivalentSQL Query

MySQLRelational

DBMS

Result

Graphical CA Query

Digital Information Curation 34 04/18/23

Experimental Results

Digital Information Curation 35 04/18/23

Discussion

Digital Information Curation 36 04/18/23

Literature

[Geerts et al., 2005] Geerts, F., Kementsietsidis, A., and Milano, D., „MONDRIAN: Annotating and querying databases

through colors and blocks“, Accepted for ICDE 2006, 2005

[Buneman et al., 2005] Buneman, P., Bose, R., Ecklund, D., „Annotation in Scientific Data: a Scoping Report“, 2005

[Grey et al., 2002] Grey, J., Szalay, A.S., Thakar, A.R., Stoughton, C., van den Berg, J., „Online Scientific Data Curation, Publication, and Archiving“ ,Technical Report MSR-TR-2002-74, Microsoft Research, 2002