A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. ·...

Mitg

lied

derH

elm

holtz

-Gem

eins

chaf

t

A Concept ofGeneric Workspace forBig Data Processingin Humanities

2013-10-08 Jedrzej Rybicki, Benedikt von St. Vieth & Daniel Mallmann

DARIAH

Digital Research Infrastructure for the Arts and Humanities

DARIAH-DE, german part of DARIAH

supports Digital Humanities by providing

digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure

Jülich Supercomputing Center

Involved in the process of building an infrastructure which is generic, easy to

use, and provides state-of-the-art processing and storage services.

2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2

DARIAH-DE Storage Service

For bit-preservation purposes DARIAH-DE offers a Storage Service.

A researcher can use this service and

upload and download data objects using any HTTP client

expect that everything is stored in a safe manner (achieved using

replication across resources/computing centers)

The Storage Service is

providing a HTTP-based interface to storage resources

using a database to store basic metadata

relying on iRODS as its storage backend


Why iRODS?

The integrated Rule-Oriented Data System was chosen because it provides

1 the rule-engine, allowing to modify the behavior of the system

actions, like acPostProcForPut, to react on system-eventsrules, written in a native language, providing loops, if-statements, ...microservices, the smallest pieces of work

many microservices already available, used and chained together in ruleswritten in C, advanced users can extend iRODS functionality

2 storage-drivers, abstracting various storage technologies

for file-system and some other storage-providers drivers are build intoiRODSimplementing a common set of interactions (create, move, delete, ...) onecan access any type of storage system


iRODS Example

1 acPostProcForPut {2 ON( $objPath l i k e "∗ / sayhe l lo . do " ) {3 sampleRule ( " He l lo User ! " , ∗s ta tus ) ;4 }5 }6 # ∗ t e x t = input , ∗s ta tus = output7 sampleRule (∗ t ex t , ∗s ta tus ) {8 msiWriteRodsLog ("∗ t e x t " , ∗s ta tus ) ;9 }


DARIAH-DE Storage Service


Sample Repository ...


... Processing Result


Motivation

A researcher wants to extract information from the stored data objects

she can download the data and process them locally

waste her time, resources, and network bandwidthlack of processing power

iRODS provides microservices which can be used for processing

requires C expertiseand reconfiguration/recompilation of the server

Goal: Active Storage

Provide a long term storage with processing functionalities at one place and

without increasing the complexity of the existing Storage Service.


Concept

The following decisions were made to extend the service

use the existing namespace to integrate a processing engine

utilize filesystem instructions (create, read, delete) to interface this engine

abstract the details of the underlying services

Generic Workspace

The researcher just has to interact with the namespace, everything is

provided at one place.


Big Data Processing

For the prototype we have select a processing engine, having few

requirements

addressable through iRODS

parallel processing of large amounts of data

We decided to use Hadoop for the prototype because it

implements the Map Reduce programming paradigm

is widely used in industry products for Big Data analysis

scales, if the prototype gets widely used we can grow the Hadoop-Cluster


Apache Hadoop

Some information about the processing engine we have chosen

Open Source Framework

implements Map Reduce

based on Java

provides HDFS, a parallel filesystem that

divides files into chunks and distributes them over cluster nodesimplements replication


How This Works Together

iRODS

rule-engine, triggering MapReduce jobs after file ingestion

storage-driver, moving incoming files to HDFS

Hadoop

execution of Map Reduce jobs

storing files for processing on HDFS


Architecture


Technical Aspects

HDFS in iRODS:

part of a compound resource

files ingested into iRODS are uploaded to HDFS

currently using univMSSInterface.sh

“Job” management:

acPostProcForPut reacts on ingestion of */proc/*-like files

delayed rule that submits the Pig script and make the results available is

started with msiExecCmd

Scripts management:

one common iRODS collection with scripts

common parameters handling (at least input and output must be defined)


Apache Pig

Apache Pig is a platform that creates Hadoop jobs, based on user-defined

SQL-like queries.

1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’


Summary

Generic Workspace for Big Data Processing

implementation of a working prototype was done

follows the idea of an active storage with processing functionalities

instead of just storing data

uses a declarative approach, the user just has to define the expected

results

provides a Workspace that users, but also applications and other

services, can interact with

powerusers can extend the service by uploading Pig scripts

This prototype is extensible

other processing frameworks can be integrated


Another iRODS Example

1 acPostProcForPut {2 ON( $objPath l i k e "∗ / proc / wordCount " ) {3 [ . . . ]4 msiSp l i tPa th (∗path , ∗p r o c c o l l e c t i o n , ∗jobname ) ;5 msiSp l i tPa th (∗ p r o c c o l l e c t i o n , ∗parent , ∗ ignored ) ;6 [ . . . ]7 ∗arg ="∗parent ∗output ∗jobname ∗s c r i p t C o l l e c t i o n " ;8 msiExecCmd ( " runPigJob . sh " , "∗ arg " , " n u l l " , " n u l l " , " n u l l " ,∗OUT) ;9 [ . . . ]

10 }11 }


Pig job wordfreq results

1 . . .2 1775 the3 1040 of4 730 i n5 677 and6 457 to7 343 was8 334 a9 331 und

10 248 die11 223 he12 . . .


A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. ·...

Documents

Transcript of A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. ·...