6/27/2015 Henrico Dolfing Seminar Digital Information Curation MONDRIAN: Annotating and querying...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 6/27/2015 Henrico Dolfing Seminar Digital Information Curation MONDRIAN: Annotating and querying...
04/18/23
Henrico Dolfing
Seminar Digital Information Curation
MONDRIAN: Annotating and querying databases through colors and blocks
Digital Information Curation 2 04/18/23
Outline
Introduction Colors and Blocks Color Algebra Mondrian System Discussion References
Digital Information Curation 3 04/18/23
Introduction
Geerts, F., Kementsietsidis, A., Milano, D., “MONDRIAN: Annotating and querying databases through color and blocks”, accepted for ICDE 2006
Annotation-oriented data model for manipulating and querying both data and annotations.
MONDRIAN, a prototype implementation of the annotation mechanism
Digital Information Curation 4 04/18/23
Motivation
Scientific databases Huge amounts of data Different formats (flat text, images, xml, ...)
Challenges Integrate, annotate and cross reference such diverse
collections of data. Maintain data provenance
Pressing needs of biological databases
Digital Information Curation 5 04/18/23
Use Case (1/2)
GDB, a human genome database
Swissprot, a proteine database
gid gname chr120231 NF1 17120232 NF2 22120233 NGFB 1120234 NGFR 17120235 NHS 21
pid pnameA01399 Nerve growth factorA25218 Tumor necrosis factorA 45770
Merlin
I78852 NeurofibromatisQ6T45 Nancy-Horan
syndrome
Digital Information Curation 6 04/18/23
Use Case (2/2)
PIR, a protein sequence database
SwissProt & PIR UniProt
sid sname originP01138 Nerve growth factor Huma
nP08138 TNR16 Huma
nP14543 Nidogen Huma
nP21359 Neurofibromin Huma
nP35240 Merlin Huma
n
Digital Information Curation 7 04/18/23
Colors and Blocks (1/2)
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
John
Mary
Peter
Mary
John
John,Mary
Digital Information Curation 8 04/18/23
Colors and Blocks (2/2)
Block = annotated group of attribute values Color = each annotation is represented by a color
Block overlapping Inheritance Transitivity
Color Queries = queries on annotated databases, that are written in a “Color Algebra”
Digital Information Curation 9 04/18/23
Color Algebra (1/2)
Projection Selection Cartesian product Block selection Block projections Merge Recoloring Renaming Union
Digital Information Curation 10 04/18/23
Color Algebra (2/2)
Definition: The color algebra consists of all expressions obtained by composing a finite number of the operators.
Theorem: The set of operators in the color algebra is minimal
Digital Information Curation 11 04/18/23
Projection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
Digital Information Curation 12 04/18/23
L-Type Block Projection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
I78825
A45770
A25218
PID
120231
120232
120234
GID
Digital Information Curation 13 04/18/23
U-Type Block Projection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
Digital Information Curation 14 04/18/23
Combined Block Projection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
I78825
A45770
A01399
PID
120231
120232
120233
GID
Digital Information Curation 15 04/18/23
Query example
Consider original relation in our use case.
Assume we want to find all the tuples that have a block annotated by Mary, or concern the protein with sid P038138.
Assume we are only interested in keeping the {gid,sid} attributes from these tuples.
Digital Information Curation 16 04/18/23
Block Selection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
John
Mary
Peter
Mary
John
John,Mary
Digital Information Curation 17 04/18/23
Block Selection
I78825
A45770
A01399
PID
120231
120232
120233
GID
P21359
P35240
P01138
SID
Mary
Mary
John,Mary
Digital Information Curation 18 04/18/23
Selection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
Digital Information Curation 20 04/18/23
Union
I78825
A45770
A01399
PID
120231
120232
120233
GID
P21359
P35240
P01138
SID
Mary
Mary
John,Mary
A25218 120234 P08138
Digital Information Curation 21 04/18/23
Union
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
Mary
Mary
John,Mary
Digital Information Curation 22 04/18/23
Projection
I78825
A45770
A01399
A25218
PID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
Mary
Mary
John,Mary
Digital Information Curation 23 04/18/23
Projection
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
Mary
Mary
John,Mary
Digital Information Curation 24 04/18/23
Cartesian Product
I78825
A45770
A25218
PID
120231
120232
120234
GID
120231
120232
120233
120234
GID
P21359
P35240
P01138
P08138
SID
Digital Information Curation 25 04/18/23
Cartesian Product
I78825
A45770
A25218
PID
120231
120232
120234
GID
120231
120232
120234
GID’
P21359
P35240
P08138
SID’
Digital Information Curation 26 04/18/23
Merge
Projecting out GID’
I78825
A45770
A25218
PID
120231
120232
120234
GID
P21359
P35240
P08138
SID’
Digital Information Curation 27 04/18/23
Merge
Projecting out GID
I78825
A45770
A25218
PID
120231
120232
120234
GID’
P21359
P35240
P08138
SID’
Digital Information Curation 28 04/18/23
120231
120232
120234
Merge
I78825
A45770
A25218
PID GID’
P21359
P35240
P08138
SID’
Digital Information Curation 29 04/18/23
Mondrian System
Piet Mondria(a)n: Dutch painter whose paintings mainly consist of color blocks
Victory Boogie Woogie (€ 40.000.000)
Digital Information Curation 30 04/18/23
Desirable properties
No restructuring of the existing database schema Only extra tables need to be added
Minimum overhead in terms of Space Query execution time
Annotations should be treated as first class citizens of the database, ie be able to query them
Digital Information Curation 31 04/18/23
Current state of Mondrian System
Text basedCA Query
EquivalentCRA Query
EquivalentSQL Query
MySQLRelational
DBMS
Result
Graphical CA Query
Digital Information Curation 32 04/18/23
Relational Representation
Assume assoc(pid,bpid), assoc(gid,bgid) and assoc (sid,bsid)
Data is separated from annotation representation
pid gid sid bpid bgid bsid ٦I78852
12031
P21359 0 0 0 C
I78852
12031
P21359 1 1 0 John
I78852
12031
P21359 0 1 1 Mary
Digital Information Curation 33 04/18/23
Current state
Text basedCA Query
EquivalentCRA Query
EquivalentSQL Query
MySQLRelational
DBMS
Result
Graphical CA Query
Digital Information Curation 36 04/18/23
Literature
[Geerts et al., 2005] Geerts, F., Kementsietsidis, A., and Milano, D., „MONDRIAN: Annotating and querying databases
through colors and blocks“, Accepted for ICDE 2006, 2005
[Buneman et al., 2005] Buneman, P., Bose, R., Ecklund, D., „Annotation in Scientific Data: a Scoping Report“, 2005
[Grey et al., 2002] Grey, J., Szalay, A.S., Thakar, A.R., Stoughton, C., van den Berg, J., „Online Scientific Data Curation, Publication, and Archiving“ ,Technical Report MSR-TR-2002-74, Microsoft Research, 2002