MataNui - Building a Grid Data Infrastructure that "doesn't suck!"

1
MataNui – Building a Grid Data Infrastructure that “doesn’t suck!” G. K. Kloss and M. J. Johnson {G.Kloss | M.J.Johnson}@massey.ac.nz Institute of Information and Mathematical Sciences, Massey University, Auckland Introduction MOA (Microlensing Observations in Astrophysics) [1] is a Japan/New Zealand collaboration project. It makes observations on dark matter, extra-solar plan- ets and stellar atmospheres using the gravitational microlensing technique, and is one of the few tech- niques to detect low-mass extra-solar planets. The technique is working by analysing large quantities of imagery from optical telescopes. The Problem Astronomers world-wide are producing telescopic im- ages. Teams across New Zealand and Japan are ac- cessing these for their research, creating higher level data products. All these data files need to be stored and retrieved. Currently they are either stored on re- motely accessible servers, or they are transferred via removable offline media. Additionally, every stored item is annotated with potentially extensive sets of meta-data. Researchers commonly keep essential parts of this meta-data separately on their system, so that they can identify particular items to retrieve for their work. This process is tedious and manually cumbersome. Especially, as not all data files are online available and may require accessing various forms of offline media. If available online, they have to be retrieved from remote systems through potentially slow connections. Direct and homogeneous access patterns for all data files and their associated meta-data does not exist. Data management is not a new topic, and there are many solutions for it available already. Many of them are hand knit, and many are commercial and poten- tially very expensive. But even more important, they usually do not work well together with current Grid infrastructures, as they were not designed to be “Grid ready.” They are often complicated and require al- tering the research workflow to suit the needs of the system. Lastly, the ones meeting most of the require- ments commonly do not provide graphical end-user tools to support data intensive research. Requirements The envisioned Grid enabled data management sys- tem has to meet a few requirements. But most of all, it should be implementable without having to “re- invent the wheel.” It should be possible to source large portions of its essential components from exist- ing (free) tools, and “just” require some “plumbing” to join them to meet these requirements: Handle large amounts of data Arbitrary amounts of meta-data Manage storage/access from remote locations Use local access/storage through replication Perform (server side) queries on the meta-data Be robust, easy to deploy and easy to use Performance on larger data collections Use Grid Computing standards/practices Abstract In science and engineering the problem of being troubled by data management is quite common. Particularly, if partners within a project are geographically distributed, and require fast access to data. These partners would ideally like to access or store data on local servers only, still retaining access for remote partners without manual intervention. This project is attempting to solve such data management problems in an international research collaboration (in astrophysics) with the participation of several New Zealand universities. Data is to be accessed and managed along with its meta-data in several distributed locations, and it has to integrate with the infrastructure provided by the BeSTGRID project. Researchers also need to be able to use simple but powerful graphical user interface for data management. This poster outlines the requirements, implementation and tools involved for such a Grid data infrastructure. Keywords: Grid Computing; Data Fabric; distributed; meta-data; data replication; GUI client; DataFinder. Front Ends GridFTP is the most common way to integrate data services into a Grid environment. It is commonly used for scripts and automation. GridFTP is the most common denominator for compatibility with the Grid, and it features (among others) the capability of using Grid certificate based authentication and third-party transfers. File System Mounts are a common way to inte- grate externally stored file systems directly into the host system of compute resources (e. g. compute clus- ter, high performance computing servers). This en- ables scripts and applications to use the data simply and directly without an additional retrieval or upload step. Figure 1: Concept DataFinder: Data modelling and storage. The DataFinder GUI Client [2] is an application researchers can use as an easy to use end-user tool supporting their data intensive needs (Fig. 2 and 4). It has been developed as open source software by the German Aerospace Centre to support internal projects and external partners. The application allows easy and flexible access to remote data repositories with associated meta-data. The DataFinder is designed for scientific and engineering purposes, and it assists in this through the following: Handles access/transfer to/from data server(s) Retrieval and modification of meta-data Extensive (server-side) queries on all meta-data Support for project specific policies: Data hierarchy definition Enforcement of workflows Meta-data specification Scripting used to automate reoccurring tasks Can integrate 3 rd party (GUI) tools The DataFinder can act as a universal Grid/storage system client [3] client (Fig. 1), as it is easily extensi- ble to connect to further storage sub-systems (beyond the list of ones already available). Figure 2: Integration of GUI applications with DataFinder. Implementation See Fig. 3. Storage back-end (GridFS on MongoDB) – For a straight forward implementation, a suitable data storage server was sought. We chose the “NoSQL” database MongoDB [4]. It features the “GridFS” storage mode, capable of storing file-like data (in large numbers and sizes) along with its meta-data. MongoDB cam work in federation with distributed servers, and data being automatically replicated to the other instances. Therefore, every site can oper- ate their own local MongoDB server, keeping data access latencies low and performance high. Native file system mount (GridFS FUSE) – A GridFS FUSE driver [5] is already available. So a remote GridFS can be mounted into a local Linux system. Grid front-end (GridFTP) – To provisions access through Grid means, the Griffin GridFTP server [6] by the Australian Research Collaboration Service (ARCS) is equipped with a GridFS storage back-end. Through this, every Grid capable tool can be used to store/retrieve files with any of the MongoDB in- stances interfaced by a Griffin server. This access method also allows Grid applications to access the storage server using the commonly used certificates. Figure 3: Overview Grid data infrastructure. Figure 4: Turbine simulation workflow with DataFinder (with custom GUI dialogues). GUI front-end (DataFinder) – The DataFinder is to be interfaced with the GridFS storage back- end. To avoid giving a remote end user client full access to the MongoDB server, a server interface layer is introduced. For this, a RESTful web ser- vice authenticating against a Grid certificate is imple- mented. The implementation is based on the Apache web server through the WSGI interface layer [7]. On the client side, the DataFinder is facilitated with a storage back-end accessing this web service. The DataFinder is currently the only client fully capable of making use of the available meta-data (creating, modifying and accessing meta-data, as well as per- forming efficient server-side queries on it). Particu- larly server-side queries reduce data access latencies significantly and improve query performance. WebDAV front-end (Catacomb) – A potential future pathway to access GridFS content is the Cat- acomb WebDAV server [8]. It can be modified to use GridFS/MongoDB as a storage back-end instead of the currently used MySQL relational database. Results By choosing suitable existing building blocks, it be- comes comparably simple to implement a consistent Grid data infrastructure with the desired features. The implementation makes currently good progress, and is expected to be simple to deploy and configure, as well as integrate seamlessly into the BeSTGRID or other projects’ infrastructures. Particularly the prob- lems of operating on large amounts of annotated data from astrophysics research seem to benefit from this research significantly. Data can be stored and ac- cessed from geographically remote partners equally fast, and processing on the data can be performed locally. Data processing can easily be conducted on sets returned as the results of queries (e. g. of par- ticular spacial regions, indicating specific phenomena indicated in the meta-data, produced by certain tele- scopes, in given time frames, etc.). References [1] I. A. Bond, F. Abe, R. Dodd, et al., “Real-time difference imaging analysis of MOA Galactic bulge observations during 2000,” Monthly Notices of the Royal Astronomical Society, vol. 327, pp. 868–880, 2001. [2] T. Schlauch and A. Schreiber, “DataFinder – A Scientific Data Management Solution,” in Proceedings of Symposium for Ensuring Long-Term Preservation and Adding Value to Scientific and Technical Data 2007 (PV 2007), Munich, Germany, October 2007. [3] T. Schlauch, A. Eifer, T. Soddemann, and A. Schreiber, “A Data Management System for UNICORE 6,” in Proceedings of EuroPar Workshops – UNICORE Summit, ser. Lecture Notes in Computer Science (LNCS). Delft, Netherlands: Springer, August 2009. [4] “MongoDB Project,” http://www.mongodb.org/. [5] M. Stephens, “GridFS FUSE Project,” http://github.com/ mikejs/ gridfs-fuse. [6] S. Zhang, P. Coddington, and A. Wendelborn, “Connecting arbitrary data resources to the Grid,” in Proceedings of the 11th International Conference on Grid Computing (Grid 2010). Brussels, Belgium: ACM/IEEE, October 2010. [7] N. Pi¨ el, “Benchmark of Python WSGI Servers,” http://nichol.as/ benchmark-of-python-web-servers, March 2010. [8] M. Litz, “Catacomb WebDAV Server,” in UpTimes – German Unix User Group (GUUG) Members’ Magazine, April 2006, pp. 16–19.

description

Poster presented at the New Zealand eResearch Symposium 2010 in Auckland, New Zealand

Transcript of MataNui - Building a Grid Data Infrastructure that "doesn't suck!"

Page 1: MataNui - Building a Grid Data Infrastructure that "doesn't suck!"

MataNui – Building a Grid Data Infrastructure that “doesn’t suck!”G.K. Kloss and M. J. Johnson

{G.Kloss | M.J.Johnson}@massey.ac.nzInstitute of Information and Mathematical Sciences, Massey University, Auckland

Introduction

MOA (Microlensing Observations in Astrophysics) [1]is a Japan/New Zealand collaboration project. Itmakes observations on dark matter, extra-solar plan-ets and stellar atmospheres using the gravitationalmicrolensing technique, and is one of the few tech-niques to detect low-mass extra-solar planets. Thetechnique is working by analysing large quantities ofimagery from optical telescopes.

The Problem

Astronomers world-wide are producing telescopic im-ages. Teams across New Zealand and Japan are ac-cessing these for their research, creating higher leveldata products. All these data files need to be storedand retrieved. Currently they are either stored on re-motely accessible servers, or they are transferred viaremovable offline media. Additionally, every storeditem is annotated with potentially extensive sets ofmeta-data. Researchers commonly keep essentialparts of this meta-data separately on their system,so that they can identify particular items to retrievefor their work.This process is tedious and manually cumbersome.Especially, as not all data files are online available andmay require accessing various forms of offline media.If available online, they have to be retrieved fromremote systems through potentially slow connections.Direct and homogeneous access patterns for all datafiles and their associated meta-data does not exist.Data management is not a new topic, and there aremany solutions for it available already. Many of themare hand knit, and many are commercial and poten-tially very expensive. But even more important, theyusually do not work well together with current Gridinfrastructures, as they were not designed to be “Gridready.” They are often complicated and require al-tering the research workflow to suit the needs of thesystem. Lastly, the ones meeting most of the require-ments commonly do not provide graphical end-usertools to support data intensive research.

Requirements

The envisioned Grid enabled data management sys-tem has to meet a few requirements. But most of all,it should be implementable without having to “re-invent the wheel.” It should be possible to sourcelarge portions of its essential components from exist-ing (free) tools, and “just” require some “plumbing”to join them to meet these requirements:

•Handle large amounts of data• Arbitrary amounts of meta-data•Manage storage/access from remote locations• Use local access/storage through replication• Perform (server side) queries on the meta-data• Be robust, easy to deploy and easy to use• Performance on larger data collections• Use Grid Computing standards/practices

AbstractIn science and engineering the problem of being troubled by data management is quite common. Particularly, if partners within a project aregeographically distributed, and require fast access to data. These partners would ideally like to access or store data on local servers only, still retainingaccess for remote partners without manual intervention. This project is attempting to solve such data management problems in an internationalresearch collaboration (in astrophysics) with the participation of several New Zealand universities. Data is to be accessed and managed along with itsmeta-data in several distributed locations, and it has to integrate with the infrastructure provided by the BeSTGRID project. Researchers also needto be able to use simple but powerful graphical user interface for data management. This poster outlines the requirements, implementation and toolsinvolved for such a Grid data infrastructure.

Keywords: Grid Computing; Data Fabric; distributed; meta-data; data replication; GUI client; DataFinder.

Front Ends

GridFTP is the most common way to integrate dataservices into a Grid environment. It is commonly usedfor scripts and automation. GridFTP is the mostcommon denominator for compatibility with the Grid,and it features (among others) the capability of usingGrid certificate based authentication and third-partytransfers.

File System Mounts are a common way to inte-grate externally stored file systems directly into thehost system of compute resources (e. g. compute clus-ter, high performance computing servers). This en-ables scripts and applications to use the data simplyand directly without an additional retrieval or uploadstep.

Figure 1: Concept DataFinder:Data modelling and storage.

The DataFinder GUI Client [2] is an applicationresearchers can use as an easy to use end-user toolsupporting their data intensive needs (Fig. 2 and 4).It has been developed as open source software by theGerman Aerospace Centre to support internal projectsand external partners. The application allows easyand flexible access to remote data repositories withassociated meta-data. The DataFinder is designedfor scientific and engineering purposes, and it assistsin this through the following:• Handles access/transfer to/from data server(s)• Retrieval and modification of meta-data• Extensive (server-side) queries on all meta-data• Support for project specific policies:–Data hierarchy definition– Enforcement of workflows–Meta-data specification

• Scripting used to automate reoccurring tasks• Can integrate 3rd party (GUI) toolsThe DataFinder can act as a universal Grid/storagesystem client [3] client (Fig. 1), as it is easily extensi-ble to connect to further storage sub-systems (beyondthe list of ones already available).

Figure 2: Integration of GUIapplications with DataFinder.

Implementation

See Fig. 3.

Storage back-end (GridFS on MongoDB) –For a straight forward implementation, a suitable datastorage server was sought. We chose the “NoSQL”database MongoDB [4]. It features the “GridFS”storage mode, capable of storing file-like data (inlarge numbers and sizes) along with its meta-data.MongoDB cam work in federation with distributedservers, and data being automatically replicated tothe other instances. Therefore, every site can oper-ate their own local MongoDB server, keeping dataaccess latencies low and performance high.

Native file system mount (GridFS FUSE) –A GridFS FUSE driver [5] is already available. Soa remote GridFS can be mounted into a local Linuxsystem.

Grid front-end (GridFTP) – To provisions accessthrough Grid means, the Griffin GridFTP server [6]by the Australian Research Collaboration Service(ARCS) is equipped with a GridFS storage back-end.Through this, every Grid capable tool can be usedto store/retrieve files with any of the MongoDB in-stances interfaced by a Griffin server. This accessmethod also allows Grid applications to access thestorage server using the commonly used certificates.

Figure 3: Overview Grid data infrastructure.

Figure 4: Turbine simulationworkflow with DataFinder

(with custom GUI dialogues).

GUI front-end (DataFinder) – The DataFinderis to be interfaced with the GridFS storage back-end. To avoid giving a remote end user client fullaccess to the MongoDB server, a server interfacelayer is introduced. For this, a RESTful web ser-vice authenticating against a Grid certificate is imple-mented. The implementation is based on the Apacheweb server through the WSGI interface layer [7]. Onthe client side, the DataFinder is facilitated with astorage back-end accessing this web service. TheDataFinder is currently the only client fully capableof making use of the available meta-data (creating,modifying and accessing meta-data, as well as per-forming efficient server-side queries on it). Particu-larly server-side queries reduce data access latenciessignificantly and improve query performance.WebDAV front-end (Catacomb) – A potentialfuture pathway to access GridFS content is the Cat-acomb WebDAV server [8]. It can be modified to useGridFS/MongoDB as a storage back-end instead ofthe currently used MySQL relational database.

Results

By choosing suitable existing building blocks, it be-comes comparably simple to implement a consistentGrid data infrastructure with the desired features.The implementation makes currently good progress,and is expected to be simple to deploy and configure,as well as integrate seamlessly into the BeSTGRID orother projects’ infrastructures. Particularly the prob-lems of operating on large amounts of annotated datafrom astrophysics research seem to benefit from thisresearch significantly. Data can be stored and ac-cessed from geographically remote partners equallyfast, and processing on the data can be performedlocally. Data processing can easily be conducted onsets returned as the results of queries (e. g. of par-ticular spacial regions, indicating specific phenomenaindicated in the meta-data, produced by certain tele-scopes, in given time frames, etc.).

References

[1] I. A. Bond, F. Abe, R. Dodd, et al., “Real-time difference imaging analysis of MOA Galactic bulge observations during 2000,” MonthlyNotices of the Royal Astronomical Society, vol. 327, pp. 868–880, 2001.

[2] T. Schlauch and A. Schreiber, “DataFinder – A Scientific Data Management Solution,” in Proceedings of Symposium for EnsuringLong-Term Preservation and Adding Value to Scientific and Technical Data 2007 (PV 2007), Munich, Germany, October 2007.

[3] T. Schlauch, A. Eifer, T. Soddemann, and A. Schreiber, “A Data Management System for UNICORE 6,” in Proceedings of EuroParWorkshops – UNICORE Summit, ser. Lecture Notes in Computer Science (LNCS). Delft, Netherlands: Springer, August 2009.

[4] “MongoDB Project,” http://www.mongodb.org/.

[5] M. Stephens, “GridFS FUSE Project,” http://github.com/ mikejs/ gridfs-fuse.

[6] S. Zhang, P. Coddington, and A. Wendelborn, “Connecting arbitrary data resources to the Grid,” in Proceedings of the 11th InternationalConference on Grid Computing (Grid 2010). Brussels, Belgium: ACM/IEEE, October 2010.

[7] N. Piel, “Benchmark of Python WSGI Servers,” http://nichol.as/ benchmark-of-python-web-servers, March 2010.

[8] M. Litz, “Catacomb WebDAV Server,” in UpTimes – German Unix User Group (GUUG) Members’ Magazine, April 2006, pp. 16–19.