LDV: Light-weight Database Virtualization · application virtualization (LDV) approach we present...
Transcript of LDV: Light-weight Database Virtualization · application virtualization (LDV) approach we present...
LDV: Light-weight Database VirtualizationQuan Pham2, Tanu Malik1, Boris Glavic3 and Ian Foster1,2
Computation Institute1and Department of Computer Science2,3
University of Chicago1,2, Argonne National Laboratory1
Illinois Institute of Technology3
Application Virtualization
Alice’s Machine Bob’s Machine&RGH�'DWD�(QYLURQPHQW
$XGLWƔ ,QYRNH�DORQJVLGH�WKH�FRPSXWDWLRQ
FGH�S\WKRQ�ZHDWKHUBVLP�S\�WRN\R�GDW
Ɣ &DSWXUHV�V\VWHP�FDOOV�XVLQJSWUDFHż ([HFXWH�DQG�FRS\�LQWR�FGH�SDFNDJH�FGH�URRW
Application Virtualization for DB Applications
Application
Operating System
File System File System Slice
Pkg
Copy AV
Alice'sComputer
chdir(“/usr”)open(“lib/libc.so.6”)DB Server
Application Virtualization for DB Applications
• Applications that interact with a relational database
• Examples:
• Text-mining applications that download data, preprocess and insert into a personal DB
• Analysis scripts using parts of a hosted database
Application
Operating System
File System File System Slice
Pkg
Copy AV
Alice'sComputer
chdir(“/usr”)open(“lib/libc.so.6”)DB Server
Why doesn’t it work?• Application virtualization methods are
oblivious to semantics of data in a database system
• The database state at the time of sharing the application may not be the same as the start of the application
• Databases are often shared among multiple users andacross many application. Thus, to re-execute an applica-tion, the database state, as of the start of the application,has to be restored.
• Provenance can be used to understand a shared applica-tion. While database and application provenance are wellunderstood, combining these two types of provenanceremains challenging.
None of the above mentioned methods - companion web-sites, VM images, and application virtualization - addressesthese challenges. There is no automatic mechanism for cap-turing and linking application and DB provenance, theseapproaches provide no means for determining which data isrelevant for an application, they do not solve the issue ofresetting a database to a previous state, and do not addressthe licensing problem of sharing the binaries of commercialdatabase servers.
For example, application virtualization is currently limitedto local applications that do not communicate to server pro-cesses, such as a web server or a database server. In fact,when an application communicates with a database server,the technique can atmost record the communication betweenthe client and database server. This is not sufficient fordetermining which data was used by the application (and, thus,should be included in the package) and to be able to resetthe database to its state before application execution started.Temporal databases provide a solution for the later problem,but not for the earlier. Virtualization can ensure reproducibilityif the user has control over the database server, the serveris started as part of the application (thus the virtualizationsystem can capture a consistent state of the database files ondisk and the server binaries) and shutdown before the capturemechanism is stopped. However, this will include completedatabase into the resulting package.
The goal of this work is to improve computational repro-ducibility for database applications. The light-weight databaseapplication virtualization (LDV) approach we present in thiswork addresses the aforementioned challenges. In particular,LDV enables users to easily create a light-weight databaseapplication virtualization (LDV) package, consisting of code,data, software dependencies, a slice of the database with whichis required for re-execution, and provenance. If shared witha 2nd party, the application runs in exactly the same wayas it did for the original user, without requiring installationor configuration of a database server at the target site. Theprovenance included in a package can used to understanddata dependencies across the application and database, and todetermine which parts of a workflow are needed to re-createa partial result.
II. LIGHT-WEIGHT DATABASE VIRTUALIZATION
We describe our approach by means of an example repro-ducibility task. Consider a user Alice who has been using adatabase in the past to conduct her experiments. She has finallydeveloped a database application which reads some input dataand outputs some analysis that she believes is interesting to
share with Bob (Figure 1). Alice would preferably like to sharethis application in the form of package P with Bob, who maywant to re-execute the application in its entirety or may want tovalidate, just the analysis task, or provide his own data inputsto examine the analysis result.
If Alice wants Bob to re-execute and build upon herdatabase application, then Bob must have access to an en-vironment that consists of application binaries and data, anyextension modules that the code depends upon (e.g., dynam-ically linked libraries), a database server and a database onwhich the application can be re-executed. Ideally, it wouldbe useful if Alice’s environment can be virtualized and thusautomatically set up for Bob.
P3 P4Other experiments
f1
P1 Insert
t1
t2
t3
Query
P2
t4
f2
Alice’sexperiment
Database
Fig. 1: Alice’s experiment with processes P1 and P2 uses tuplet1, inserts tuple t3, creates final output f2. Dumping databaseproduces redundant tuple t2. Capturing Alice’s experiment inits fullness makes t3 redundant. Only t1 is needed for theexperiment to execute.
If we assume that Alice’s application consists of set ofmodules that read data from files and/or retrieve data froma database, and write data to files and/or write data to adatabase, the database server is accessed through standardSQL language commands, and Alice executes her applicationthrough a command line script, providing a single entry pointfor monitoring the application, then several questions arisewith respect to virtualizing her environment. In particular:
•How do we include the necessary and sufficient data, i.e.,data that corresponds to her last experiment in the virtualizedenvironment? As Figure 1 shows there are data (tuples) in thedatabase that are not part of the current experiment and ifincluded in the package, may increase the size considerably,not leading to a light-weight virtualized environment.
•How can a self-contained package be created so that Bobdoes not have to install or configure a database server?
•How can Bob re-execute the database application, par-tially, or wholly, without communicating with Alice’s databaseserver?
We describe our primary contributions in addressing thesequestions, and also describe an overall organization map forthe paper. In summary we monitor database applications andcombine database and application provenance to determinenecessary and sufficient data. We describe how this data can be
LDV: Light-weight Database Virtualization
• Goal: Easily and efficiently share and repeat DB applications.
Key Ideas
• DB application = Application (OS) part + DB part
• Use data provenance to capture interactions from/to the application side to the database side
• Limited formal mechanisms so far to combine the two kinds of provenance models
• Create a virtualized package that can be re-executed
• Either include the server and data, or replay interactions (for licensed databases)
• No virtualization mechanism for database replay
Related Work
• Application virtualization
• Linux Containers, CDE[Usenix’11]
• Packaging with annotations
• Docker
• Packaging with provenance
• PTU1[TaPP’13], ReproZip[TaPP’13], Research Objects
• Unified provenance models
• based on program instrumentation [TaPP’12]
1 Q. Pham, T. Malik, and I. Foster. Using provenance for repeatability. In Theory and Practice of Provenance (TaPP), 2013.
How does LDV work?
Application
Operating System
File System
DB Server
Execution Trace
DB Server DB Slice
File System Slice
Pkg
Copy LDV
Alice'sComputer
Alice’s Machine
ldv-audit db-app
• Monitoring system calls
• Monitoring SQL
• Server-included packages
• Server-excluded packages
• Execution traces
• Relevant DB and filesystem slices
• Redirecting file access
• Redirecting DB access
• Server-included packages
• Server-excluded packagesFile System
Bob'sComputerUser Application
Operating System
DB Server
Execution Trace
DB Server DB Slice
File System Slice
Pkg
LDV Redirect
Bob’s Machine
ldv-exec db-app
How does LDV work?
Example
Alice:~$ ldv-audit app.shApplication package created as app-pkgAlice:~$ lsapp-pkg app.sh src dataAlice:~$echo "Hi Bob, Please find the pkg --Alice" \ | mutt -s "Sharing DB Application -a "./app-pkg" \ -- [email protected]
Bob:~$ ls .app-pkgBob:~$ cd app-pkgBob:~$ lsapp.sh src dataBob:~$ldv-exec app.shRunning app-pkg....
Ubuntu 14.04(Kernel 3.13)
+Postgres 9.1
CentOS 6.2(Kernel 2.6.32)
+MySQL
LDV Issues
• Monitoring system calls
• Monitoring SQL
• Execution traces
• Relevant DB slices
• Redirecting file access
• Server-included packages
• Server-excluded packages
• Redirecting DB access
Size Comparison
10
100
1000
10000
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Pack
age
size
(MB)
Query
PTU packageServer-included packageServer-excluded package
Fig. 9: LDV packages are significantly smaller than PTUpackages when queries have low selectivity.
and the LDV packages. The VMI is 8.2 GB: 80 times largerthan the average LDV package (100MB). To evaluate runtimeperformance, we instantiate this VMI using the same numberof cores and memory as in our machine to execute our queries.Recall that Figure 8b shows that re-executing these queriesin a VM is slightly slower than a non-audited PostgreSQLexecution, and significantly slower than LDV packages.
X. CONCLUSIONS
We introduced a light-weight DB virtualization (LDV)system that can permit sharing and re-execution of appli-cations that perform DB operations. This system uses datacollected via application monitoring to create re-executablepackages that include an application, its dependencies (datafiles, relevant DB tuples), and a combined execution trace.Such packages can be used to repeat an application or part ofan application in a different environment.
Our LDV framework features an innovative integration ofdistinct OS and DB provenance models, and new methodsfor inferring data dependencies that cross model boundaries.The resulting system creates execution traces according to thisframework and uses these traces to determine which data needsto be included in a repeatability package. It leaves to the userthe choice of whether the package should include the DBMS.Our prototype implementation integrates the PTU (OS) andPerm (DB) provenance systems. In future work, we plan tointegrate with the DB-independent GProM [2] middleware.
XI. ACKNOWLEDGEMENTS
This material is based upon work supported by the NationalScience Foundation under Grants ICER-1343816 and SES-0951576,and by the US Department of Energy under contract DE-AC02-06CH11357. Any opinions, findings, and conclusions or recommen-dations expressed in this material are those of the authors and do notnecessarily reflect the views of the National Science Foundation.
REFERENCES
[1] Y. Amsterdamer et al. Putting Lipstick on Pig: Enabling Database-styleWorkflow Provenance. PVLDB, 5(4), 2011.
[2] B. Arab, B. Glavic, et al. A generic provenance middleware for databasequeries, updates, and transactions. In Proceedings of TaPP, 2014.
[3] N. Balakrishnan, T. Bytheway, et al. Opus: A lightweight system forobservational provenance in user space. In TaPP, 2013.
[4] C. T. Brown. Some myths of reproducible computational research. http://ivory.idyll.org/blog/2014-myths-of-computational-reproducibility.html.
[5] J. Cheney et al. Provenance in Databases: Why, How, and Where.Foundations and Trends in Databases, 1(4), 2009.
[6] F. Chirigati and J. Freire. Towards integrating workflow and databaseprovenance. In Provenance and Annotation of Data and Processes. 2012.
[7] F. S. Chirigati, D. Shasha, and J. Freire. Reprozip: Using provenanceto support computational reproducibility. In TaPP, 2013.
[8] S. C. Dey, S. Riddle, and B. Ludascher. Provenance analyzer: Exploringprovenance semantics with logic rules. In TaPP, 2013.
[9] J. Freire and C. T. Silva. Making computations and publicationsreproducible with vistrails. Computing in Science and Engineering,14(4), 2012.
[10] B. Glavic et al. Perm: Processing Provenance and Data on the sameData Model through Query Rewriting. In ICDE, 2009.
[11] B. Glavic et al. Using sql for efficient generation and querying ofprovenance information. In In search of elegance in the theory andpractice of computation: a Festschrift in honour of Peter Buneman. 2013.
[12] C. A. Goble and D. C. De Roure. myExperiment: social networkingfor workflow-using e-scientists. In Proceedings of the 2Nd Workshopon Workflows in Support of Large-scale Science, 2007.
[13] P. J. Guo et al. CDE: using system call interposition to automaticallycreate portable software packages. In USENIX Annual TechnicalConference, Portland, OR, 2011.
[14] B. Howe. Virtual appliances, cloud computing, and reproducibleresearch. Computing in Science & Engineering, 14(4):36–41, 2012.
[15] G. Karvounarakis and T. Green. Semiring-annotated data: Queries andprovenance. SIGMOD Record, 41(3):5–14, 2012.
[16] K. Keahey et al. Virtual workspaces for scientific applications. Journalof Physics: Conference Series, 78(1), 2007.
[17] N. Kwasnikowska, L. Moreau, and J. Van den Bussche. A formalaccount of the open provenance model. journal, 2010.
[18] S. Lampoudi. The path to virtual machine images as first classprovenance. Age, 2011.
[19] T. Malik, L. Nistor, and A. Gehani. Tracking and sketching distributeddata provenance. In International Conference on eScience, 2010.
[20] L. Moreau and P. Missier. Prov-dm: The prov data model.http://www.w3.org/TR/2013/REC-prov-dm-20130430/, 2013.
[21] K.-K. Muniswamy-Reddy, U. Braun, D. A. Holland, P. Macko,D. Maclean, D. W. Margo, M. I. Seltzer, and R. Smogor. Layeringin provenance systems., 2009.
[22] Q. Pham, T. Malik, and I. Foster. Using provenance for repeatability.In TaPP, 2013.
[23] Q. Pham, T. Malik, and I. Foster. Auditing and maintaining provenancein software packages. In IPAW, 2014.
[24] M. Stamatogiannakis et al. Looking inside the black-box: Capturingdata provenance using dynamic instrumentation. In TAPP, 2014.
[25] Transaction Processing Performance Council. TPC-H benchmark spec-ification. Published at http://www.tcp.org/hspec.html, 2008.
[26] M. Zhang et al. Tracing Lineage beyond Relational Operators. In VLDB,2007.
• LDV packages are significantly smaller than PTU packages when queries have low selectivity
• The VMI is 8.2 GB: 80 times larger than the average LDV package (100MB).
Audit and Replay
TABLE II: The 18 TPC-H benchmark queries used in our experiments
Queries SQL PARAM Sel.Q1-1 toQ1-5
SELECT l quantity, l partkey , l extendedprice , l shipdate , l receiptdate FROM lineitemWHERE l suppkey BETWEEN 1 AND PARAM
10, 20, 50, 100,250
1%, 2%, 5%,10%, 25%
Q2-1 toQ2-4
SELECT o comment, l comment FROM lineitem l, orders o, customer c WHERE l.l orderkey= o.o orderkey AND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;
0000000, 000000,00000, 0000
66%, 6.6%,0.66%, 0.06%
Q3-1 toQ3-4
SELECT count(⇤) FROM lineitem l, orders o, customer c WHERE l.l orderkey = o.o orderkeyAND o.o custkey = c.c custkey AND c.c name LIKE ’%PARAM%’;
0000000, 000000,00000, 0000
66%, 6.6%,0.66%, 0.06%
Q4-1 toQ4-5
SELECT o orderkey, AVG(l quantity) AS avgQ FROM lineitem l, orders o WHERE l.l orderkey= o.o orderkey AND l suppkey BETWEEN 1 AND PARAM GROUP BY o orderkey;
10, 20, 50, 100,250
1%, 2%, 5%,10%, 25%
(a) Audit
1e-05
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Inserts FirstSelect
OtherSelects
Updates
Exec
utio
n tim
e (s
econ
ds)
PostgreSQL + PTUServer-included packageServer-excluded package
(b) Replay
1e-05
0.0001
0.001
0.01
0.1
1
10
100
1000
10000
Initialization Inserts FirstSelect
OtherSelects
Updates
Exec
utio
n tim
e (s
econ
ds)
PostgreSQL + PTU
1E-0
5
0.01
001
0.08
0.05
3
0.00
01
Server-included package
4.19
0.00
063
0.05
0.02
5
0.00
03
Server-excluded package
0.01
2E-0
5
0.01
0.00
9
0.00
01
Fig. 7: Execution time of each step in an application with query Q1-1
(a) Audit
0.001
0.01
0.1
1
10
100
1000
10000
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Exec
utio
n tim
e (s
econ
ds)
Query
PostgreSQL + PTUServer-included packageServer-excluded package
(b) Replay
0.001
0.01
0.1
1
10
100
1-1 1-2 1-3 1-4 1-5 2-1 2-2 2-3 2-4 3-1 3-2 3-3 3-4 4-1 4-2 4-3 4-4 4-5
Exec
utio
n tim
e (s
econ
ds)
Query
PostgreSQL + PTUServer-included packageServer-excluded package
VM
Fig. 8: Execution time for each query, during audit (left) and replay (right)
TABLE III: Package Contents: PTU packages contain all datafiles of the full DB, whereas server-included LDV packagescontain the data files of an empty DB.
Package type Softwarebinaries
DBserver
Datafiles
DBprovenance
PTU 3 3 3(full) 7LDV server-included 3 3 3(empty) 3LDV server-excluded 3 7 7 3
server in the application execution, i.e., the server binaries anddata files. An LDV package contains DB provenance for re-execution, the DB server binaries, and an empty data directoryin the server-included scenario (Table III).
Figure 9 shows the sizes of the PTU, server-included, andserver-excluded packages constructed for the queries listedin Table II. Server-included LDV packages are significantlysmaller than PTU packages, because they contain only those
tuples needed to re-execute the application—which, for thesequeries, is at most ⇠25% of all tuples. Server-excluded LDVpackages are often yet smaller, because they contain only thequery results—which, for many of our experiment queries, aresmaller than the tuples required for re-execution. However,recall that server-excluded packages have less flexibility thando server-included packages.
F. Comparison with the Virtual Machine Approach
We compare a virtual machine image (VMI) with the server-included and server-excluded LDV approaches. The VMI iscreated based on a bare-bone Debian Wheezy 64bit VMI onwhich we install the DB server and the experiment binaries asin Section IX-A. We use “apt-get” to install a DB server, and“scp” to copy all DB files and source code for the experimentfrom our machine. Using the created VMI, we run the sameapplication to compare the size and performance of this VMI
LDV amortizes audit cost significantly at replay time
Summary• LDV permits sharing and repeating DB
applications
• LDV combines OS and DB provenance to determine file and DB slices
• LDV creates light-weight virtualized packages based on combined provenance
• Results show LDV is efficient, usable, and general
• LDV at http://github.com/lordpretzel/ldv.git