A Service for Data-Intensive Computations

A Service for Data-Intensive Computations

on Virtual Clusters

Rainer Schmidt, Christian Sadilek, and Ross King

[email protected]

Intensive 2009, Valencia, Spain

Executing Preservation Strategies at Scale


Planets Project

“Permanent Long-term Access through NETworked Services”

Addresses the problem of digital preservation

driven by National Libraries and Archives

Project instrument: FP6 Integrated Project

5. IST Call

Consortium: 16 organisations from 7 countries

Duration: 48 months, June 2006 – May 2010

Budget: 14 Million Euro http://www.planets-project.eu/

http://www.planets-project.eu/


Outline

Digital Preservation

Need for eScience Repositories

The Planets Preservation Environment

Distributed Data and Metadata Integration

Data-Intensive Computations

Grid Service for Job Execution on AWS

Experimental Results

Conclusions


Drivers for Digital Preservation

Exponential growth of digital data across all sectors of society

Production large quantities of often irreplaceable information

Produced by academic and other research

Data becomes significantly more complex and diverse

Digital information is fragile

Ensure long term access and interpretability

Requires data curation and preservation

Development and assessment of preservation strategies

Need for integrated and largely automated environments


Repository Systems and eScience

Digital repositories and libraries are designed for the creation, discovery, and publication of digital collections systems often heterogeneous, closed, limited in scale,

Data grid middleware focus on the controlled sharing and management of large data sets in a distributed environment. limited in metadata mgnt., and preservation functionality

Integrate repositories into large-scale environment distributed data management capabilities

integration of policies, tools, and workflows

integration with computational facilities


The Planets Environment

A collaborative research infrastructure for the systematic development and evaluation and of preservation strategies.

A decision-support system for the development of preservation plans

An integrated environment for dynamically providing, discovering and accessing a wide range of preservation tools, services, and technical registries.

A workflow environment for the data integration, process execution, and the maintenance of preservation information.


Service Gateway Architecture

Preservation Planning (Plato)

Experimentation Testbed Application

Notification andLoggingSystem

Workflow Execution UI

Workflow Execution and

Monitoring

Experiment Data and Metadata

Repository

Service and Tool

Registry

Application Services

ExecutionServices

Data Storage Services

AdministrationUI

Authenticationand

Authorization

User Applications

Portal Services

Application Execution and Data Services

Physical Resources, Computers, Networks


Data and Metadata Integration

Support distributed and remote storage facilities

data registry and metadata repository

Integrate different data sources and encoding standards Currently a range of interfaces/protocols (OAI-PMH, SRU) and encoding standards

(DC, METS) are supported.

Development of common digital object model

Recording of provenance information, preservation, integrity, and descriptive metadata

Dynamic mapping of objects from different digital libraries (TEL, memory institutions, research collections)


Data Catalogue

ExperimentData

MetadataRepository

DigObj DigObj

Content RepositoryContent Repository

PUID

Temp Storage

PortalApplication

ApplicationService

Data and Metadata Registry


Data Intensive Applications

Development Planets Job Submission Services

Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)

Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL)

Demand for massive compute and storage resources

Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.)

Allows computation to be moved to data or vice-versa

On-demand cluster based on virtual machine images

Many Preservation Tools are/rely on 3rd party applications

Software can be preinstalled on virtual cluster nodes


Experimental Architecture

JSDL

Virtual Node (Xen)

JSS

Data Service

Cloud Infrastructure (EC2)

Storage Infrastructure

(S3)

Virtual Cluster and File System (Apache Hadoop)

Digital Objects

Job Description File

Job

HPCBP

Wor

kflo

w E

nviro

nmen

t


A Light-Weight Job Execution ServiceTL

S/W

S Se

curit

y

WS - ContainerB

ES/H

PCB

P A

PI

AccountManager

JSDLParser

SessionHandler

Exec

. Man

ager Data Resolver

Input Generator

Job Manager


A Light-Weight Job Execution ServiceTL

S/W

S Se

curit

y

WS - ContainerB

ES/H

PCB

P A

PI

AccountManager

JSDLParser

SessionHandler

Exec

. Man

ager Data Resolver

Input Generator

Job Manager

HDFS

S3

Hadoop


The MapReduce Application

Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.

Automated decomposition (split)

Mapping to intermediary pairs (map), optionally (combine)

Merge output (reduce)

Provides implementation for data parallel + i/o intensive applications

Migrating a digital object (e.g web page, folder, archive)

Decompose into atomic pieces (e.g. file, image, movie)

On each node, process input splits

Merge pieces and create new data collections


Experimental Setup Amazon Elastic Compute Cloud (EC2)

1 – 150 cluster nodes

Custom image based on Fedora 8 i386

Amazon Simple Storage Service (S3) max. 1TB I/O per experiment

All measurements based on a single core per virtual machine

Apache Hadoop (v.0.18) MapReduce Implementation

Preinstalled command line tools

Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size


Experimental Results 1 – Scaling Job Size

0,001,002,003,004,005,006,007,008,009,00

10,00

1 10 100 1000

tasks

time

[min

]

EC2 0,07 MB

EC2 7,5 MB

EC2 250 MB

SLE 0,07 MB

SLE 7,5 MB

SLE 250 MB

number of nodes = 5x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6

x(1k) = t_seq / t_paralleland tasks = 1000


Experimental Results 2 – Scaling #nodes

0,00

5,00

10,00

15,00

20,00

25,00

30,00

35,00

40,00

0 50 100 150

nodes

time

[min

]

EC2 1000 x 0,07 MB

SLE 1000 x 0,07 MB

XX

XX X

X n=1, t=36, s1 = 0.72, e=72%

n=5, t=4.5, s=3.3, e=66%

n=10, t=4.5, s=5.5, e=55%

n=50, t=1.68, s=16, e=31%

n=100, t=1.03, s=26, e=26%

n=1 (local), t=26


Results and Observations

AWS + Hadoop + JSS

Robust and fault tolerant, scalable up to large numbers of nodes ~32,5MB/s download / ~13,8MB/s upload (cloud internally)

Also small clusters of virtual machines reasonable

S = 4,4 for p = 5, with n=1000 and s=7,5MB

Ideally, size of cluster grows/shrinks on demand

Overheads:

SLE vs. Cloud (p=1, n=1000) 30% due to file system master

In average, less then 10% due to S3

Small overheads due to coordination, latencies, pre-processing


Conclusion

Challenges in digital preservation are diversity and scale Focus: scientific data, arts and humanities

Repositories Preservation systems need to be embedded with large-scale distributed environments grids, clouds, eScience environments

Cloud Computing provides a powerful solution for getting on-demand access to an IT infrastructure. Currently integration issues: Policies, Legal Aspects, SLAs

We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies


Fin

A Service for Data-Intensive Computations

Technology

Transcript of A Service for Data-Intensive Computations