A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. ·...
Transcript of A Concept of Generic Workspace for Big Data Processing in … · 2013. 9. 1. ·...
Mitg
lied
derH
elm
holtz
-Gem
eins
chaf
t
A Concept ofGeneric Workspace forBig Data Processingin Humanities
2013-10-08 Jedrzej Rybicki, Benedikt von St. Vieth & Daniel Mallmann
DARIAH
Digital Research Infrastructure for the Arts and Humanities
DARIAH-DE, german part of DARIAH
supports Digital Humanities by providing
digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure
Jülich Supercomputing Center
Involved in the process of building an infrastructure which is generic, easy to
use, and provides state-of-the-art processing and storage services.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2
DARIAH
Digital Research Infrastructure for the Arts and Humanities
DARIAH-DE, german part of DARIAH
supports Digital Humanities by providing
digital methods and tools for research and educationa platform that enables the interconnection of various disciplinesa sustainable research infrastrucure
Jülich Supercomputing Center
Involved in the process of building an infrastructure which is generic, easy to
use, and provides state-of-the-art processing and storage services.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 2
DARIAH-DE Storage Service
For bit-preservation purposes DARIAH-DE offers a Storage Service.
A researcher can use this service and
upload and download data objects using any HTTP client
expect that everything is stored in a safe manner (achieved using
replication across resources/computing centers)
The Storage Service is
providing a HTTP-based interface to storage resources
using a database to store basic metadata
relying on iRODS as its storage backend
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 3
Why iRODS?
The integrated Rule-Oriented Data System was chosen because it provides
1 the rule-engine, allowing to modify the behavior of the system
actions, like acPostProcForPut, to react on system-eventsrules, written in a native language, providing loops, if-statements, ...microservices, the smallest pieces of work
many microservices already available, used and chained together in ruleswritten in C, advanced users can extend iRODS functionality
2 storage-drivers, abstracting various storage technologies
for file-system and some other storage-providers drivers are build intoiRODSimplementing a common set of interactions (create, move, delete, ...) onecan access any type of storage system
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 4
iRODS Example
1 acPostProcForPut {2 ON( $objPath l i k e "∗ / sayhe l lo . do " ) {3 sampleRule ( " He l lo User ! " , ∗s ta tus ) ;4 }5 }6 # ∗ t e x t = input , ∗s ta tus = output7 sampleRule (∗ t ex t , ∗s ta tus ) {8 msiWriteRodsLog ("∗ t e x t " , ∗s ta tus ) ;9 }
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 5
DARIAH-DE Storage Service
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 6
Sample Repository ...
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 7
... Processing Result
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 8
Motivation
A researcher wants to extract information from the stored data objects
she can download the data and process them locally
waste her time, resources, and network bandwidthlack of processing power
iRODS provides microservices which can be used for processing
requires C expertiseand reconfiguration/recompilation of the server
Goal: Active Storage
Provide a long term storage with processing functionalities at one place and
without increasing the complexity of the existing Storage Service.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9
Motivation
A researcher wants to extract information from the stored data objects
she can download the data and process them locally
waste her time, resources, and network bandwidthlack of processing power
iRODS provides microservices which can be used for processing
requires C expertiseand reconfiguration/recompilation of the server
Goal: Active Storage
Provide a long term storage with processing functionalities at one place and
without increasing the complexity of the existing Storage Service.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 9
Concept
The following decisions were made to extend the service
use the existing namespace to integrate a processing engine
utilize filesystem instructions (create, read, delete) to interface this engine
abstract the details of the underlying services
Generic Workspace
The researcher just has to interact with the namespace, everything is
provided at one place.
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 10
Big Data Processing
For the prototype we have select a processing engine, having few
requirements
addressable through iRODS
parallel processing of large amounts of data
We decided to use Hadoop for the prototype because it
implements the Map Reduce programming paradigm
is widely used in industry products for Big Data analysis
scales, if the prototype gets widely used we can grow the Hadoop-Cluster
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11
Big Data Processing
For the prototype we have select a processing engine, having few
requirements
addressable through iRODS
parallel processing of large amounts of data
We decided to use Hadoop for the prototype because it
implements the Map Reduce programming paradigm
is widely used in industry products for Big Data analysis
scales, if the prototype gets widely used we can grow the Hadoop-Cluster
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 11
Apache Hadoop
Some information about the processing engine we have chosen
Open Source Framework
implements Map Reduce
based on Java
provides HDFS, a parallel filesystem that
divides files into chunks and distributes them over cluster nodesimplements replication
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 12
How This Works Together
iRODS
rule-engine, triggering MapReduce jobs after file ingestion
storage-driver, moving incoming files to HDFS
Hadoop
execution of Map Reduce jobs
storing files for processing on HDFS
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 13
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
Architecture
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 14
Technical Aspects
HDFS in iRODS:
part of a compound resource
files ingested into iRODS are uploaded to HDFS
currently using univMSSInterface.sh
“Job” management:
acPostProcForPut reacts on ingestion of */proc/*-like files
delayed rule that submits the Pig script and make the results available is
started with msiExecCmd
Scripts management:
one common iRODS collection with scripts
common parameters handling (at least input and output must be defined)
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 15
Apache Pig
Apache Pig is a platform that creates Hadoop jobs, based on user-defined
SQL-like queries.
1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16
Apache Pig
Apache Pig is a platform that creates Hadoop jobs, based on user-defined
SQL-like queries.
1 data = LOAD ’ path /∗ ’ USING TextLoader ( ) ;2 token = FOREACH data GENERATE FLATTEN(TOKENIZE( $0 ) ) AS word ;3 words = FILTER token BY word MATCHES ’ \ \ w+ ’ ;4 gr = GROUP words BY word ;5 c = FOREACH gr GENERATE COUNT( words ) AS cnt , gr ;6 res = ORDER c BY cnt ;7 STORE res INTO ’ path / output . dat ’
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 16
Summary
Generic Workspace for Big Data Processing
implementation of a working prototype was done
follows the idea of an active storage with processing functionalities
instead of just storing data
uses a declarative approach, the user just has to define the expected
results
provides a Workspace that users, but also applications and other
services, can interact with
powerusers can extend the service by uploading Pig scripts
This prototype is extensible
other processing frameworks can be integrated
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 17
Another iRODS Example
1 acPostProcForPut {2 ON( $objPath l i k e "∗ / proc / wordCount " ) {3 [ . . . ]4 msiSp l i tPa th (∗path , ∗p r o c c o l l e c t i o n , ∗jobname ) ;5 msiSp l i tPa th (∗ p r o c c o l l e c t i o n , ∗parent , ∗ ignored ) ;6 [ . . . ]7 ∗arg ="∗parent ∗output ∗jobname ∗s c r i p t C o l l e c t i o n " ;8 msiExecCmd ( " runPigJob . sh " , "∗ arg " , " n u l l " , " n u l l " , " n u l l " ,∗OUT) ;9 [ . . . ]
10 }11 }
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 18
Pig job wordfreq results
1 . . .2 1775 the3 1040 of4 730 i n5 677 and6 457 to7 343 was8 334 a9 331 und
10 248 die11 223 he12 . . .
2013-10-08 A Concept of Generic Workspace for Big Data Processing in Humanities Slide 19