Dissertation defense

36
EUROPEAN UNION Data source registration in the Virtual Laboratory Marek Pomocka major: applied computer science specialisation: computer techniques in science and technology Faculty of Physics and Applied Computer Science, AGH University of Science and Technology Supervisor: Marian Bubak, Ph.D. Consultants: Piotr Nowakowski, M.Sc. Daniel Harężlak, M.Sc. Master’s thesis defense November 13, 2009

description

Dissertation title and final project: Data source registration in the Virtual Laboratory. The subject of the thesis and related project was to integrate EGEE/WLCG data sources into GridSpace Virtual Laboratory (http://gs.cyfronet.pl/). Poster presentation entitled Integrating EGEE Storage Services with the Virtual Laboratory: http://www.plgrid.pl/en/pr_materials/posters Dissertation available at http://virolab.cyfronet.pl/trac/vlvl#MasterofScienceThesesrelatedtoViroLab

Transcript of Dissertation defense

Page 1: Dissertation defense

EUROPEAN UNION

Data source registration in the Virtual Laboratory

Marek Pomockamajor: applied computer science

specialisation: computer techniques in science and technologyFaculty of Physics and Applied Computer Science,

AGH University of Science and Technology

Supervisor: Marian Bubak, Ph.D.Consultants: Piotr Nowakowski, M.Sc.

Daniel Harężlak, M.Sc.

Master’s thesis defenseNovember 13, 2009

Page 2: Dissertation defense

Introduction to Grid technologies and Virtual Laboratories

Motivation and ObjectivesConceptual view onto the solutionChallenges and solutionsApplicationsFuture workSummaryReferences

Outline

Page 3: Dissertation defense

GRID TECHNOLOGIES AND VIRTUAL LABORATORIES

3

Page 4: Dissertation defense

Grid is a distributed computing architecture with cross-organizational access, providing nontrivial quality of service for participating actors.

Page 5: Dissertation defense

Notable applications include

high-energy physics

(LHC)

Complex parameter studies in biomedicine and biochemistry

Weather

forecastingNatural disaster modelling

Digital

image

archives

Page 6: Dissertation defense

Grid is a computer

infrastructure

.. dedicated to conducting

in-silico research

TASK

PCSSICM

CYFRONET

WCSS

created by many partnerswho share supercomputers, computer clusters, storage and research instruments

Page 7: Dissertation defense

to create common space for e-Science

Page 8: Dissertation defense

Grid users are

Virtual

Organizations

(VOs)

CYFRONET PSNC CYFRONET PSNC

which are dynamic

by their nature

VO approach simplifies access management

Page 9: Dissertation defense

Examples of Grids

TeraGrid

Open Science Grid

EGEE, DEISA

Page 10: Dissertation defense

Virtual Laboratories (VLs) supply

higher-level services and abstract

low-level details related to Grid

services invocations, security etc.

away from end-users.Grid middleware

Virtual Laboratory

Grid infrastructure

Many VLs endeavor to be general

purpose in-silico (or virtual)

experiment design and execution

environment,

e.g. GridSpace

Virtual Laboratory.

Page 11: Dissertation defense

Others are often designed for specific purpose

such as remote

access to

scientific

instruments

(e.g. VLAB)

supporting research in

meteorology (LEAD)

research and decision support

in virology

(ViroLab)

Page 12: Dissertation defense

if (condition) then …else …end

… or using workflow

languages (e.g. in VL-e,

VLAB, myExperiment,

myGrid Taverna, Kepler,

Triana, Pegasus)

Virtual experiments in VLs are

expressed using script-based

languages (e.g. in GridSpace,

Athena, Geodise)

VLs made Grids

available to non-

computer scientists.

Virtual Laboratory

Grid Users

Page 13: Dissertation defense

MOTIVATION AND OBJECTIVES

13

Page 14: Dissertation defense

Hello, I’m a chemist. I use Gaussian program and

work mostly with files. I’d like to use Grids, but filesystem is far too

complex for me.

... the security system is

complicated too.

Yes, I do agree. We won’t use Grids until there is an easy way of using Grid file catalogues from virtual experiments.

Page 15: Dissertation defense

Objectives The objective of the dissertation is to meet these

needs by enabling access to LFC data sources

from GridSpace scripts concealing most of

interactions with Grid Security Infrastructure (GSI).

This goal entails several other

objectives:Data Source

Registry

reorganization

extending DSR

EPE plug-in

Integration with GridSpace Engine

GSEngine

DAC2

LFC DS

Page 16: Dissertation defense

Conceptual view onto the solution

Page 17: Dissertation defense

CHALLENGES AND SOLUTIONS

17

Page 18: Dissertation defense

Not to comprise GSEngine portability

Linux

UNIX

Windows

Mac OS X

Scientific Linux 4 (SL4)

Platform independent

GScript LFC integration

LFC connector LFC client library LFC DS Server

GSEngine

Solution:

Platform dependent

Isolation of platform

dependent code into

a remote service

Page 19: Dissertation defense

Serve multiple users utilizing inherently single user gLite libraries.

Solution:

ChemPo command wrappers – each

command is run in new JVM with

prepared UNIX environment.

Instead of permanent place for a credentials (e.g. ~/.globus/),

use temporary files and specify paths dynamically in UNIX

environment of created JVM processes.

Cert1

Cert2 Key2

Key1

Worker 2 JVM

Worker 1 JVMLFC DS Server

(ServerJVM)

Page 20: Dissertation defense

Enabling access to Grid files without downloading them to

GSEngine machine

Grid File Access Library (GFAL)

ChemPo command wrappers do

not support such a mode of

operation (streaming to client)

First, download file to LFC DS

Server. Then, stream it to client.

Vice—versa for sending file to

Grid, i.e. stream file to LFC DS

Server, then send it to Grid.

Page 21: Dissertation defense

Streaming representation in GridSpace scripts

Solution: User receives modified version of Ruby IO object

(sending file to Grid happens on file close operation while

retrieving a file from Grid during object initialization)Reading a Grid file

ds.open("mpomocka/test_file", "r") do |file| file.each {|line| puts line}endf = ds.open("mpomocka/test_file", :r)f.each {|line| puts line}f.close

Writing to a Grid filef = ds.open("mpomocka/test_file",:write)f.puts "First line of the file test_file"f.puts "Second line of the file test_file"f.close

Alternativelyds.open("mpomocka/test_file",:w) do |f| f.puts "Another way to write to a file" f.puts "Note that close is not necessary“end

Page 22: Dissertation defense

Need for a descriptive and intuitive API

mimicking Ruby file operations,

e.g. exist?, file?

DAC2 LFC DS methods

Method name, Aliases

createDirectory(parent,child),create_directorycreateDirectory(path), create_directorydelete(path), delete_file, deleteFiledeleteFile(filename)directory?(filename), isDirectory, is_directoryexist?(path), exist, exists, exist?file?(path), isFile, is_filegetFile(filename), get_filegetSize(path), size, size?, get_sizelistFiles(path), list_filesopenFile(path, mode, &b), open, open_filestoreFile(payload, filename), store_filezero?(path)

e.g. create_directory

instead of mkdir

Page 23: Dissertation defense

Security Secure communication

Need to manage keystores

Tunnelling is simpler

Transport Layer Security

Credentials management

Proxy certificate generation

Java CoG Kit

Credentials are stored in DSRData Source Registry

Credentials can be set

static, i.e. shared with other

authenticated users

Page 24: Dissertation defense

Proxy generated automatically during initialization

Page 25: Dissertation defense

Information needs – previous DSR structure did not enable

storage of LFC data sources information nor gLite credentials.

Solution:

DataSourcesRelationalDataSource

sDataSources

LFCDataSources

LFCCertData

Also changes to DAC2 and DSR EPE

Plug-in DSR access modules.

LFCDSConnections++

Page 26: Dissertation defense

GUI for registering data source of new type

Created as a new form in EPE DSR Plug-in

In addition, some new DSR access methods were created in DSR EPE Plug-in.

Page 27: Dissertation defense

Selection of distributed computing approachTechnology Com

munication overhead

Development cost

Operation when endpoints are protected by firewall

Unnecessary features

Java RMI Low Low Difficult Few

SOAP High Moderate Uncomplicated Few

Heavy-weight distributed computing frameworks (e.g. CORBA, EJB)

? Moderate or high ?

Many

Socket-based communication

Low Very high Uncomplicated Few

Cajo Low Low Uncomplicated Few

Page 28: Dissertation defense

Exchanging large files – how to avoid OutOfMemory errors?

Solution: employ RMIIO library (RemoteInputStream[Server]

and RemoteOutputStream[Server] classes)

Figure illustrates downloading a file to client

Page 29: Dissertation defense

Figure – sending a file from client to server

Additional benefits

of using RMIIO:

Compressed socket-based communication

Automaticretry

Page 30: Dissertation defense

Solution scales linearly

Figure – download and upload times up to 2Gb when tested

locally on ChemPo server

Page 31: Dissertation defense

PL-Grid:

Polish Infrastructure for

Information Science Support in

the European Research Space.

Chemistry Portal – ChemPo

Applications

Page 32: Dissertation defense

Finer-grained security

Pseudo memory mapped-file API

(Pseudo MMAP)

Future work

Page 33: Dissertation defense

SUMMARY

33

Page 34: Dissertation defense

LFC DS Server LFC DS client Java library

DAC2 LFC connector DAC2 LFC DS methods

Method name, Aliases

createDirectory(parent,child),create_directorycreateDirectory(path), create_directorydelete(path), delete_file, deleteFiledeleteFile(filename)directory?(filename), isDirectory, is_directory….

New DAC2 API

Page 35: Dissertation defense

Automated and transparent

handling of Grid credentials

Reorganized DSR Schema

Extended EPE

DSR Plug-in

Page 36: Dissertation defense

References

[1] M. Pomocka,  P. Nowakowski, and M. Bubak, Integrating EGEE Storage Services with the Virtual Laboratory. Poster presented as part of the Cracow Grid Workshop ’09, Krakow, Poland, 12-14 October 2009.

[2] M. Pomocka,  P. Nowakowski, and M. Bubak, Integrating EGEE Storage Services with the Virtual Laboratory. In Marian Bubak, Michał Turała, and Kazimierz Wiatr, editors, Proceedings of Cracow Grid Workshop – CGW’09, October 2009, Krakow, Poland. ACC-Cyfronet AGH. to appear

[3] Lana Abadie et al., Grid-Enabled Standards-based Data Management. In Mass Storage Systems and Technologies, 2007. MSST 2007. 24th IEEE Conference on, pages 60–71, Sept. 2007.

[4] Marian Bubak et al., Virtual Laboratory for Collaborative Applications, In: M. Cannataro (Ed.) Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare, Information Science Reference, 2009, IGI Global

[5] Matthias Assel et al. : A Collaborative Environment Allowing Clinical Investigations on Integrated Biomedical Databases. In Tony Solomonides et al. (Ed.): Healthgrid Research, Innovation and Business Case; Proceedings of HealthGrid 2009, Studies in Health Technology and Informatics, vol 147, IOS Press, ISSN 0926-9630, pp 51 -61

[6] M. Malawski, T. Bartynski, and M. Bubak, "Invocation of operations from script-based grid applications," Future Generation Computer Systems, vol. In Press, Accepted Manuscript, 2009.

36