Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas...

24
Parallel Frameworks & Big Data Hadoop and Spark on BioHPC 1 Updated for 2015-11-18 [web] portal.biohpc.swmed.edu [email] [email protected]

Transcript of Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas...

Page 1: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Parallel Frameworks amp Big DataHadoop and Spark on BioHPC

1 Updated for 2015-11-18

[web] portalbiohpcswmededu

[email] biohpc-helputsouthwesternedu

What is lsquoBig Datarsquo

Big data amp parallel processing are hand-in-hand

Models of parallelization

Apache Hadoop

Apache Spark

Overview

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 2: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

What is lsquoBig Datarsquo

Big data amp parallel processing are hand-in-hand

Models of parallelization

Apache Hadoop

Apache Spark

Overview

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 3: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

ldquoBig Datardquo

Datasets too big for conventional data processing applications

In business often Predictive Analytics

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 4: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Big Data in Biomedical Science

4

Complex long running analyses on GB datasets ne Big Data

Small number of genomic datasets (100s of GBS) no longer lsquobig datarsquo

Easy to manage on individual modern servers with standard apps

Cross lsquoomicsrsquo integration often is a big data problem

Thousands of datasets of different typesgenomics proteomics metabalomics imaginghellip

Different formats ndash import storage integration problems

Common questions fit the lsquopredictive analyticsrsquo model of Big Data in business

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 5: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

A Big Data Example ndash Proteomics DB

5

Partnership between academia and industry (SAP)

Data size is 721TB (not huge) but contains 43 million MSMS spectra

Biggest challenge is presentation and interpretation ndash search visualization etc

Web environment uses SAP HANA160 cores and 2TB RAM behind the scenes

Nature 2014 May 29509(7502)582-7 doi 101038nature13319Mass-spectrometry-based draft of the human proteomeWilhelm M1 Schlegl J2 Hahne H3 Moghaddas Gholami A3 Lieberenz M4 Savitski MM5 Ziegler E4 Butzmann L4 Gessulat S4 Marx H6 Mathieson T5 Lemeer S6 Schnatbaum K7 Reimer U7 Wenschuh H7 Mollenhauer M8 Slotta-Huspenina J8 Boese JH4 Bantscheff M5 Gerstmair A4 Faerber F4 Kuster B9

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 6: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Big Datahellip light

6

BUT ndash all achieved using a traditional single relational database algorithms etc

No flexible search query architecture

Looks like lsquoBig Datarsquohellip

TB scale input data

40000 core hours of processing on TACC

Mol Cell Proteomics 2014 Jun13(6)1573-84 doi 101074mcpM113035170 Epub 2014 Apr 2Confetti a multiprotease map of the HeLa proteome for comprehensive proteomicsGuo X1 Trudgian DC1 Lemoff A1 Yadavalli S1 Mirzaei H2

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 7: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Consider a Big Data problemhellip

7

I have 5000 sequenced exomes with associated clinical observations and want to

bull Find all SNPs in each patient

bull Aggregate statistics for SNPs across all patients

bull Easily query for SNP associations with any of my clinical variables

bull Build a predictive model for a specific disease

A lot of data a lot of processing ndash need to do each step in parallel

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 8: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Parallelization ndash Shared Memory

8

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

1 system multiple CPUs

Each CPU has portion of RAM attached directly

CPUs can access their or otherrsquos RAM

Parallel processes or threads can access any of the data in RAM easily

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 9: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Parallelization ndash Distributed Shared Memory

9

RAM RAM RAM RAMRAM

Logical Address Space

RAM is split across many systems that use a special interconnect

The entire RAM across all machines is accessible anywhere in a single address space

Still quite simple to program but costly and now uncommon

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 10: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Parallelization ndash Message Passing

10

Many systems multiple CPUs each

Can only directly access RAM inside single system

If needed data from other system must pass as a message across network

Must be carefully planned and optimized

Difficult

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

CORE

CORE

CORE

CORE

RAM

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 11: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Map Reduce

11

Impose a rigid structure on tasks Always map then reduce

httpdmerwth-aachendederesearchprojectsmapreduce

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 12: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Distributed Computing - Hadoop

12

A different way of thinking

Specify problem as mapping and reduction steps

Run on subsets of data on different nodes

Use special distributed filesystem to communicate data between steps

The framework takes care of the parallelism

httpsenwikipediaorgwikiApache_Hadoop

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 13: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Map ndash Count words in a line

13

public static class Map extends MapperltLongWritable Text Text IntWritablegt private final static IntWritable one = new IntWritable(1)private Text word = new Text()

public void map(LongWritable key Text value Context context) throws IOExceptionInterruptedException

String line = valuetoString()StringTokenizer tokenizer = new StringTokenizer(line)

while (tokenizerhasMoreTokens()) wordset(tokenizernextToken())contextwrite(word one)

WordCountjava

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 14: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Reduce ndash Sum the occurrence of words on from all lines

14

public static class Reduce extends ReducerltText IntWritable Text IntWritablegt

public void reduce(Text key IterableltIntWritablegt values Context context) throws IOException InterruptedException

int sum = 0for (IntWritable val values)

sum += valget()contextwrite(key new IntWritable(sum))

WordCountjava

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 15: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Driver ndash Run the map-reduce task

15

public static void main(String[] args) throws Exception Configuration conf = new Configuration()

Job job = new Job(conf wordcount)

jobsetOutputKeyClass(Textclass)jobsetOutputValueClass(IntWritableclass)

jobsetMapperClass(Mapclass)jobsetReducerClass(Reduceclass)

jobsetInputFormatClass(TextInputFormatclass)jobsetOutputFormatClass(TextOutputFormatclass)

FileInputFormataddInputPath(job new Path(args[0]))FileOutputFormatsetOutputPath(job new Path(args[1]))

jobwaitForCompletion(true)

WordCountjava

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 16: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Running a Hadoop Job on BioHPC ndash sbatch script

16

binbash Run on the super partitionSBATCH -p super Use 64 tasks totalSBATCH -n 64 Across 2 nodesSBATCH -N 2 With a 1h time limitSBATCH -t 10000

module add myhadoop030-spark

export HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh$HADOOP_HOMEbinhadoop dfs -mkdir data $HADOOP_HOMEbinhadoop dfs -put pg2701txt data $HADOOP_HOMEbinhadoop dfs -ls data $HADOOP_HOMEbinhadoop jar $HADOOP_HOMEhadoop-examples-jar wordcount data wordcount-output

$HADOOP_HOMEbinhadoop dfs -ls wordcount-output $HADOOP_HOMEbinhadoop dfs -get wordcount-output

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_hadoopsbatch

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 17: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Hadoop Limitations on BioHPC

17

Inefficiency (Small) wait for everything to startupWorkers are solely dedicated to youSit idle during portions of job not highly parallelized

HDFS Uses tmp on each compute nodeSlow HDD ndash but lots of RAM for cachingNot persistent ndash deleted after job ends

Old Hadoop Running Hadoop 121Update to 2x soon

Looking for interested users to try out persistent HDFS

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 18: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

General Hadoop Limitations

18

Model Rigid map-gtreduce framework hard to model some problemsIterative algorithms can be difficult (lot of scientific analysis)

Language Java is only 1st class languageWrappers frameworks are other languages available but generally slower

HDFS Always write results to disk after mapreduceArchitecture not good for small filesrandom reading

Many things are alleviated by additional Hadoop projects ndash Hive Pig Hbase etc

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 19: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

SPARK

19

bull In-memory computing modelbull Loadsave data using HDFS or standard file systembull Scala Java Python 1st class language supportbull Interactive shells for exploratory analysisbull Libraries for database work machine learning amp linear algebra etcbull Can leverage hadoop features (HDFS HBASE) or run independently

Far easier and better suited to most scientific tasks than plain Hadoop

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 20: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Singular Value Decomposition

20

Challenge Visualize patterns in a huge assay x gene dataset

Solution Use SVD to compute eigengenes visualize data in few dimensions thatcapture majority of the interesting patterns

Wall Michael E Andreas Rechtsteiner Luis M RochaSingular value decomposition and

principal component analysis in A Practical Approach to Microarray Data Analysis DP

Berrar W Dubitzky M Granzow eds pp 91-109 Kluwer Norwell MA (2003) LANL LA-

UR-02-4001

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 21: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

SVD on the cluster with Spark

21

import orgapachesparkrddRDDimport orgapachesparkmlliblinalg_ import orgapachesparkmlliblinalgVector import orgapachesparkmlliblinalgdistributedRowMatrix

val input = sctextFile(filehome2dtrudgianDemoshadoopmatrixtxt)

val rowVectors = inputmap(

_split(t)map(_toDouble)

)map( v =gt Vectorsdense(v) )cache()

val mat=new RowMatrix(rowVectors)

val svd SingularValueDecomposition[RowMatrix Matrix] = matcomputeSVD(20 computeU = true)

Simple scripting language - This is scala can also use python

spark_svdscala

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 22: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Running Spark jobs on BioHPC

22

binbash Run on the super partitionSBATCH -p super Use 128 tasks totalSBATCH -n 128 Across 2 nodesSBATCH -N 4 With a 1h time limitSBATCH ndasht 10000

module add myhadoop030-sparkexport HADOOP_CONF_DIR=$PWDhadoop-conf$SLURM_JOBID

myhadoop-configuresh -s tmp$USER$SLURM_JOBID -i sNucleus[0]101010

$HADOOP_HOMEbinstart-allsh

source $HADOOP_CONF_DIRsparkspark-envsh myspark start

spark-shell -i large_svdscala

myspark stop

$HADOOP_HOMEbinstop-allsh

myhadoop-cleanupsh

slurm_sparksbatch

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 23: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

Demo ndash Interactive Spark ndash Python amp Scala

23

Must be on a cluster node to connect to spark workers (login node or GUI session)

Launch a spark clustersbatch slurm_spark_interactivesh

Wait for spark to start then load settingssource hadoop-confltjobidgtsparkspark-envsh

Connect using interactive scala sessionspark-shell scala

or interactive python sessionpyspark python

And shutdown when donescancel ltjobidgt

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work

Page 24: Parallel Frameworks & Big Data - UT Southwestern · Wilhelm M1, Schlegl J2, Hahne H3, Moghaddas Gholami A3, Lieberenz M4, Savitski MM5, Ziegler E4, Butzmann L4, Gessulat S4, Marx

What do you want to do

24

Discussion

What big-data problems to you have in your research

Are Hadoop andor Spark interesting for your projects

How can we help you use HadoopSpark for your work