LDV: Light-weight Database Virtualization · application virtualization (LDV) approach we present...

LDV: Light-weight Database VirtualizationQuan Pham2, Tanu Malik1, Boris Glavic3 and Ian Foster1,2

Computation Institute1and Department of Computer Science2,3

University of Chicago1,2, Argonne National Laboratory1

Illinois Institute of Technology3

Application Virtualization

Alice’s Machine Bob’s Machine&RGH�'DWD�(QYLURQPHQW

$XGLWƔ ,QYRNH�DORQJVLGH�WKH�FRPSXWDWLRQ

FGH�S\WKRQ�ZHDWKHUBVLP�S\�WRN\R�GDW

Ɣ &DSWXUHV�V\VWHP�FDOOV�XVLQJSWUDFHż ([HFXWH�DQG�FRS\�LQWR�FGH�SDFNDJH�FGH�URRW

Application Virtualization for DB Applications

Application

Operating System

File System File System Slice

Pkg

Copy AV

Alice'sComputer

chdir(“/usr”)open(“lib/libc.so.6”)DB Server

Application Virtualization for DB Applications

• Applications that interact with a relational database

• Examples:

• Text-mining applications that download data, preprocess and insert into a personal DB

• Analysis scripts using parts of a hosted database

Application

Operating System

File System File System Slice

Pkg

Copy AV

Alice'sComputer

chdir(“/usr”)open(“lib/libc.so.6”)DB Server

Why doesn’t it work?• Application virtualization methods are

oblivious to semantics of data in a database system

• The database state at the time of sharing the application may not be the same as the start of the application

• Databases are often shared among multiple users andacross many application. Thus, to re-execute an applica-tion, the database state, as of the start of the application,has to be restored.

• Provenance can be used to understand a shared applica-tion. While database and application provenance are wellunderstood, combining these two types of provenanceremains challenging.

None of the above mentioned methods - companion web-sites, VM images, and application virtualization - addressesthese challenges. There is no automatic mechanism for cap-turing and linking application and DB provenance, theseapproaches provide no means for determining which data isrelevant for an application, they do not solve the issue ofresetting a database to a previous state, and do not addressthe licensing problem of sharing the binaries of commercialdatabase servers.

For example, application virtualization is currently limitedto local applications that do not communicate to server pro-cesses, such as a web server or a database server. In fact,when an application communicates with a database server,the technique can atmost record the communication betweenthe client and database server. This is not sufficient fordetermining which data was used by the application (and, thus,should be included in the package) and to be able to resetthe database to its state before application execution started.Temporal databases provide a solution for the later problem,but not for the earlier. Virtualization can ensure reproducibilityif the user has control over the database server, the serveris started as part of the application (thus the virtualizationsystem can capture a consistent state of the database files ondisk and the server binaries) and shutdown before the capturemechanism is stopped. However, this will include completedatabase into the resulting package.

The goal of this work is to improve computational repro-ducibility for database applications. The light-weight databaseapplication virtualization (LDV) approach we present in thiswork addresses the aforementioned challenges. In particular,LDV enables users to easily create a light-weight databaseapplication virtualization (LDV) package, consisting of code,data, software dependencies, a slice of the database with whichis required for re-execution, and provenance. If shared witha 2nd party, the application runs in exactly the same wayas it did for the original user, without requiring installationor configuration of a database server at the target site. Theprovenance included in a package can used to understanddata dependencies across the application and database, and todetermine which parts of a workflow are needed to re-createa partial result.

II. LIGHT-WEIGHT DATABASE VIRTUALIZATION

We describe our approach by means of an example repro-ducibility task. Consider a user Alice who has been using adatabase in the past to conduct her experiments. She has finallydeveloped a database application which reads some input dataand outputs some analysis that she believes is interesting to

share with Bob (Figure 1). Alice would preferably like to sharethis application in the form of package P with Bob, who maywant to re-execute the application in its entirety or may want tovalidate, just the analysis task, or provide his own data inputsto examine the analysis result.

If Alice wants Bob to re-execute and build upon herdatabase application, then Bob must have access to an en-vironment that consists of application binaries and data, anyextension modules that the code depends upon (e.g., dynam-ically linked libraries), a database server and a database onwhich the application can be re-executed. Ideally, it wouldbe useful if Alice’s environment can be virtualized and thusautomatically set up for Bob.

P3 P4Other experiments

f1

P1 Insert

t1

t2

t3

Query

P2

t4

f2

Alice’sexperiment

Database

Fig. 1: Alice’s experiment with processes P1 and P2 uses tuplet1, inserts tuple t3, creates final output f2. Dumping databaseproduces redundant tuple t2. Capturing Alice’s experiment inits fullness makes t3 redundant. Only t1 is needed for theexperiment to execute.

If we assume that Alice’s application consists of set ofmodules that read data from files and/or retrieve data froma database, and write data to files and/or write data to adatabase, the database server is accessed through standardSQL language commands, and Alice executes her applicationthrough a command line script, providing a single entry pointfor monitoring the application, then several questions arisewith respect to virtualizing her environment. In particular:

•How do we include the necessary and sufficient data, i.e.,data that corresponds to her last experiment in the virtualizedenvironment? As Figure 1 shows there are data (tuples) in thedatabase that are not part of the current experiment and ifincluded in the package, may increase the size considerably,not leading to a light-weight virtualized environment.

•How can a self-contained package be created so that Bobdoes not have to install or configure a database server?

•How can Bob re-execute the database application, par-tially, or wholly, without communicating with Alice’s databaseserver?

We describe our primary contributions in addressing thesequestions, and also describe an overall organization map forthe paper. In summary we monitor database applications andcombine database and application provenance to determinenecessary and sufficient data. We describe how this data can be

LDV: Light-weight Database Virtualization

• Goal: Easily and efficiently share and repeat DB applications.

Key Ideas

• DB application = Application (OS) part + DB part

• Use data provenance to capture interactions from/to the application side to the database side

• Limited formal mechanisms so far to combine the two kinds of provenance models

• Create a virtualized package that can be re-executed

• Either include the server and data, or replay interactions (for licensed databases)

• No virtualization mechanism for database replay

Related Work

• Application virtualization

• Linux Containers, CDE[Usenix’11]

• Packaging with annotations

• Docker

• Packaging with provenance

• PTU1[TaPP’13], ReproZip[TaPP’13], Research Objects

• Unified provenance models

• based on program instrumentation [TaPP’12]

1 Q. Pham, T. Malik, and I. Foster. Using provenance for repeatability. In Theory and Practice of Provenance (TaPP), 2013.

How does LDV work?

Application

Operating System

File System

DB Server

Execution Trace

DB Server DB Slice

File System Slice

Pkg

Copy LDV

Alice'sComputer

Alice’s Machine

ldv-audit db-app

• Monitoring system calls

• Monitoring SQL

• Server-included packages

• Server-excluded packages

• Execution traces

• Relevant DB and filesystem slices

• Redirecting file access

• Redirecting DB access


• Server-excluded packagesFile System

Bob'sComputerUser Application

Operating System

DB Server

Execution Trace

DB Server DB Slice

File System Slice

Pkg

LDV Redirect

Bob’s Machine

ldv-exec db-app

How does LDV work?

Example

Alice:~$ ldv-audit app.shApplication package created as app-pkgAlice:~$ lsapp-pkg app.sh src dataAlice:~$echo "Hi Bob, Please find the pkg --Alice" \ | mutt -s "Sharing DB Application -a "./app-pkg" \ -- [email protected]

Bob:~$ ls .app-pkgBob:~$ cd app-pkgBob:~$ lsapp.sh src dataBob:~$ldv-exec app.shRunning app-pkg....

Ubuntu 14.04(Kernel 3.13)

+Postgres 9.1

CentOS 6.2(Kernel 2.6.32)

+MySQL

LDV Issues

• Monitoring system calls

• Monitoring SQL

• Execution traces

• Relevant DB slices

• Redirecting file access


• Server-excluded packages

• Redirecting DB access

Size Comparison

10

100

1000

10000

1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5

Pack

age

size

(MB)

Query

PTU packageServer-included packageServer-excluded package

Fig. 9: LDV packages are significantly smaller than PTUpackages when queries have low selectivity.

and the LDV packages. The VMI is 8.2 GB: 80 times largerthan the average LDV package (100MB). To evaluate runtimeperformance, we instantiate this VMI using the same numberof cores and memory as in our machine to execute our queries.Recall that Figure 8b shows that re-executing these queriesin a VM is slightly slower than a non-audited PostgreSQLexecution, and significantly slower than LDV packages.

X. CONCLUSIONS

We introduced a light-weight DB virtualization (LDV)system that can permit sharing and re-execution of appli-cations that perform DB operations. This system uses datacollected via application monitoring to create re-executablepackages that include an application, its dependencies (datafiles, relevant DB tuples), and a combined execution trace.Such packages can be used to repeat an application or part ofan application in a different environment.

Our LDV framework features an innovative integration ofdistinct OS and DB provenance models, and new methodsfor inferring data dependencies that cross model boundaries.The resulting system creates execution traces according to thisframework and uses these traces to determine which data needsto be included in a repeatability package. It leaves to the userthe choice of whether the package should include the DBMS.Our prototype implementation integrates the PTU (OS) andPerm (DB) provenance systems. In future work, we plan tointegrate with the DB-independent GProM [2] middleware.

XI. ACKNOWLEDGEMENTS

This material is based upon work supported by the NationalScience Foundation under Grants ICER-1343816 and SES-0951576,and by the US Department of Energy under contract DE-AC02-06CH11357. Any opinions, findings, and conclusions or recommen-dations expressed in this material are those of the authors and do notnecessarily reflect the views of the National Science Foundation.

REFERENCES

[1] Y. Amsterdamer et al. Putting Lipstick on Pig: Enabling Database-styleWorkflow Provenance. PVLDB, 5(4), 2011.

[2] B. Arab, B. Glavic, et al. A generic provenance middleware for databasequeries, updates, and transactions. In Proceedings of TaPP, 2014.

[3] N. Balakrishnan, T. Bytheway, et al. Opus: A lightweight system forobservational provenance in user space. In TaPP, 2013.

[4] C. T. Brown. Some myths of reproducible computational research. http://ivory.idyll.org/blog/2014-myths-of-computational-reproducibility.html.

[5] J. Cheney et al. Provenance in Databases: Why, How, and Where.Foundations and Trends in Databases, 1(4), 2009.

[6] F. Chirigati and J. Freire. Towards integrating workflow and databaseprovenance. In Provenance and Annotation of Data and Processes. 2012.

[7] F. S. Chirigati, D. Shasha, and J. Freire. Reprozip: Using provenanceto support computational reproducibility. In TaPP, 2013.

[8] S. C. Dey, S. Riddle, and B. Ludascher. Provenance analyzer: Exploringprovenance semantics with logic rules. In TaPP, 2013.

[9] J. Freire and C. T. Silva. Making computations and publicationsreproducible with vistrails. Computing in Science and Engineering,14(4), 2012.

[10] B. Glavic et al. Perm: Processing Provenance and Data on the sameData Model through Query Rewriting. In ICDE, 2009.

[11] B. Glavic et al. Using sql for efficient generation and querying ofprovenance information. In In search of elegance in the theory andpractice of computation: a Festschrift in honour of Peter Buneman. 2013.

[12] C. A. Goble and D. C. De Roure. myExperiment: social networkingfor workflow-using e-scientists. In Proceedings of the 2Nd Workshopon Workflows in Support of Large-scale Science, 2007.

[13] P. J. Guo et al. CDE: using system call interposition to automaticallycreate portable software packages. In USENIX Annual TechnicalConference, Portland, OR, 2011.

[14] B. Howe. Virtual appliances, cloud computing, and reproducibleresearch. Computing in Science & Engineering, 14(4):36–41, 2012.

[15] G. Karvounarakis and T. Green. Semiring-annotated data: Queries andprovenance. SIGMOD Record, 41(3):5–14, 2012.

[16] K. Keahey et al. Virtual workspaces for scientific applications. Journalof Physics: Conference Series, 78(1), 2007.

[17] N. Kwasnikowska, L. Moreau, and J. Van den Bussche. A formalaccount of the open provenance model. journal, 2010.

[18] S. Lampoudi. The path to virtual machine images as first classprovenance. Age, 2011.

[19] T. Malik, L. Nistor, and A. Gehani. Tracking and sketching distributeddata provenance. In International Conference on eScience, 2010.

[20] L. Moreau and P. Missier. Prov-dm: The prov data model.http://www.w3.org/TR/2013/REC-prov-dm-20130430/, 2013.

[21] K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko,D. Maclean, D. W. Margo, M. I. Seltzer, and R. Smogor. Layeringin provenance systems., 2009.

[22] Q. Pham, T. Malik, and I. Foster. Using provenance for repeatability.In TaPP, 2013.

[23] Q. Pham, T. Malik, and I. Foster. Auditing and maintaining provenancein software packages. In IPAW, 2014.

[24] M. Stamatogiannakis et al. Looking inside the black-box: Capturingdata provenance using dynamic instrumentation. In TAPP, 2014.

[25] Transaction Processing Performance Council. TPC-H benchmark spec-ification. Published at http://www.tcp.org/hspec.html, 2008.

[26] M. Zhang et al. Tracing Lineage beyond Relational Operators. In VLDB,2007.

• LDV packages are significantly smaller than PTU packages when queries have low selectivity

• The VMI is 8.2 GB: 80 times larger than the average LDV package (100MB).

Audit and Replay

TABLE II: The 18 TPC-H benchmark queries used in our experiments

Queries SQL PARAM Sel.Q1-1 toQ1-5

SELECT l quantity, l partkey , l extendedprice , l shipdate , l receiptdate FROM lineitemWHERE l suppkey BETWEEN 1 AND PARAM

10, 20, 50, 100,250

1%, 2%, 5%,10%, 25%

Q2-1 toQ2-4

SELECT o comment, l comment FROM lineitem l, orders o, customer c WHERE l.l orderkey= o.o orderkey AND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;

0000000, 000000,00000, 0000

66%, 6.6%,0.66%, 0.06%

Q3-1 toQ3-4

SELECT count(⇤) FROM lineitem l, orders o, customer c WHERE l.l orderkey = o.o orderkeyAND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;

0000000, 000000,00000, 0000

66%, 6.6%,0.66%, 0.06%

Q4-1 toQ4-5

SELECT o orderkey, AVG(l quantity) AS avgQ FROM lineitem l, orders o WHERE l.l orderkey= o.o orderkey AND l suppkey BETWEEN 1 AND PARAM GROUP BY o orderkey;

10, 20, 50, 100,250

1%, 2%, 5%,10%, 25%

(a) Audit

1e-05

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

Inserts FirstSelect

OtherSelects

Updates

Exec

utio

n tim

e (s

econ

ds)

PostgreSQL + PTUServer-included packageServer-excluded package

(b) Replay

1e-05

0.0001

0.001

0.01

0.1

1

10

100

1000

10000

Initialization Inserts FirstSelect

OtherSelects

Updates

Exec

utio

n tim

e (s

econ

ds)

PostgreSQL + PTU

1E-0

5

0.01

001

0.08

0.05

3

0.00

01

Server-included package

4.19

0.00

063

0.05

0.02

5

0.00

03

Server-excluded package

0.01

2E-0

5

0.01

0.00

9

0.00

01

Fig. 7: Execution time of each step in an application with query Q1-1

(a) Audit

0.001

0.01

0.1

1

10

100

1000

10000

1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5

Exec

utio

n tim

e (s

econ

ds)

Query


(b) Replay

0.001

0.01

0.1

1

10

100

1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5

Exec

utio

n tim

e (s

econ

ds)

Query


VM

Fig. 8: Execution time for each query, during audit (left) and replay (right)

TABLE III: Package Contents: PTU packages contain all datafiles of the full DB, whereas server-included LDV packagescontain the data files of an empty DB.

Package type Softwarebinaries

DBserver

Datafiles

DBprovenance

PTU 3 3 3(full) 7LDV server-included 3 3 3(empty) 3LDV server-excluded 3 7 7 3

server in the application execution, i.e., the server binaries anddata files. An LDV package contains DB provenance for re-execution, the DB server binaries, and an empty data directoryin the server-included scenario (Table III).

Figure 9 shows the sizes of the PTU, server-included, andserver-excluded packages constructed for the queries listedin Table II. Server-included LDV packages are significantlysmaller than PTU packages, because they contain only those

tuples needed to re-execute the application—which, for thesequeries, is at most ⇠25% of all tuples. Server-excluded LDVpackages are often yet smaller, because they contain only thequery results—which, for many of our experiment queries, aresmaller than the tuples required for re-execution. However,recall that server-excluded packages have less flexibility thando server-included packages.

F. Comparison with the Virtual Machine Approach

We compare a virtual machine image (VMI) with the server-included and server-excluded LDV approaches. The VMI iscreated based on a bare-bone Debian Wheezy 64bit VMI onwhich we install the DB server and the experiment binaries asin Section IX-A. We use “apt-get” to install a DB server, and“scp” to copy all DB files and source code for the experimentfrom our machine. Using the created VMI, we run the sameapplication to compare the size and performance of this VMI

LDV amortizes audit cost significantly at replay time

Summary• LDV permits sharing and repeating DB

applications

• LDV combines OS and DB provenance to determine file and DB slices

• LDV creates light-weight virtualized packages based on combined provenance

• Results show LDV is efficient, usable, and general

• LDV at http://github.com/lordpretzel/ldv.git

http://github.com/lordpretzel/ldv.git

LDV: Light-weight Database Virtualization · application virtualization (LDV) approach we present...

Documents

Transcript of LDV: Light-weight Database Virtualization · application virtualization (LDV) approach we present...