PaSh: Light-Touch Data-Parallel Shell Processing
Transcript of PaSh: Light-Touch Data-Parallel Shell Processing
![Page 1: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/1.jpg)
PaSh: Light-Touch Data-Parallel Shell Processing
Nikos Vasilakis*MIT
[email protected] github.com/andromeda/pash
Konstantinos Kallas*University of Pennsylvania
Konstantinos MamourasRice University
Achilles Benetopoulos(Unaffiliated)
Lazar CvetkovićUniversity of Belgrade
* equal contribution
![Page 2: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/2.jpg)
Shell Scripts are Everywhere
Universal composition environment Commands (programs) can be written in C, C++, Rust, JS, Python, Ruby, Haskell...
Default/scriptable system interfaceeven in the lightest containers
Kubernetes, Docker
Succinct data processing: download/extraction/
preprocessing/querying
![Page 3: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/3.jpg)
A Classic Shell ScriptBentley: A word-counting challenge
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
10 was 10 the 10 of 10 it 2 times
McIlroy: Unix one-liner
Knuth: 100s of lines of literate WEB
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
![Page 4: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/4.jpg)
A classic: Compute top-N words+counts
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
10 was 10 the 10 of 10 it 2 times
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
![Page 5: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/5.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
Itwasthebestoftimesitwasthe…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
tr -cs A-Za-z '\n'
![Page 6: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/6.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
itwasthebestoftimesitwasthe…
Itwasthebestoftimesitwasthe…
tr A-Z a-z
![Page 7: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/7.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
ageagebeliefbestdarknessdespairepochepochfoolishness…
itwasthebestoftimesitwasthe…
sort
![Page 8: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/8.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
2 age 1 belief 1 best 1 darkness 1 despair 2 epoch 1 foolishness 1 hope 10 it …
ageagebeliefbestdarknessdespairepochepochfoolishness…
uniq -c
![Page 9: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/9.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
10 was 10 the 10 of 10 it 2 times 2 season 2 epoch 2 age 1 worst 1 wisdom
2 age 1 belief 1 best 1 darkness 1 despair 2 epoch 1 foolishness 1 hope 1 incredulity 10 it
sort -rn
![Page 10: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/10.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
10 was 10 the 10 of 10 it 2 times 2 season 2 epoch 2 age 1 worst 1 wisdom …
10 was 10 the 10 of 10 it 2 times
sed ${1}q
![Page 11: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/11.jpg)
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
10 was 10 the 10 of 10 it 2 times
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch of…
How to parallelize?It was the best of times, it was the
worst of times,It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
It was the best of times, it was theworst of times,
![Page 12: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/12.jpg)
Their parallelization requires considerable effort:● Command-specific flags (e.g., sort -p, make -jN)● Mostly-manual, restricted parallelization tools (e.g., GNU parallel)● Full rewrites in parallel frameworks (e.g., MapReduce)
Shell scripts are mostly sequential
![Page 13: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/13.jpg)
Big-Data Version of McIlroy’s Pipeline
150-line Hadoop Program
import java.io.*;import java.util.*;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.mapreduce.Mapper;
public class top_10_Movies_Mapper extends Mapper<Object, Text, Text, LongWritable> {
private TreeMap<Long, String> tmap;
@Override public void setup(Context context) throws IOException, InterruptedException { tmap = new TreeMap<Long, String>(); }
@Override public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
// no_of_views (tab seperated) // we split the input data String[] tokens = value.toString().split("\t");
String movie_name = tokens[0]; long no_of_views = Long.parseLong(tokens[1]);
tmap.put(no_of_views, movie_name);
if (tmap.size() > 10) { tmap.remove(tmap.firstKey()); } }
@Override public void cleanup(Context context) throws IOException, InterruptedException { for (Map.Entry<Long, String> entry : tmap.entrySet()) { long count = entry.getKey(); String name = entry.getValue(); context.write(new Text(name), new LongWritable(count)); } } }
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;
public class Driver { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length < 2) { System.err.println("Error: please provide two paths"); System.exit(2); }
Job job = Job.getInstance(conf, "top 10"); job.setJarByClass(Driver.class);
job.setMapperClass(top_10_Movies_Mapper.class); job.setReducerClass(top_10_Movies_Reducer.class);
job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(LongWritable.class);
job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1); }}
import java.io.IOException;import java.util.Map;import java.util.TreeMap;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;
public class top_10_Movies_Reducer extends Reducer<Text, LongWritable, LongWritable, Text> { private TreeMap<Long, String> tmap2;
@Override public void setup(Context context) throws IOException, InterruptedException { tmap2 = new TreeMap<Long, String>(); }
@Override public void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
String name = key.toString(); long count = 0;
for (LongWritable val : values) { count = val.get(); }
tmap2.put(count, name);
if (tmap2.size() > 10) { tmap2.remove(tmap2.firstKey()); } }
@Override public void cleanup(Context context) throws IOException, InterruptedException {
for (Map.Entry<Long, String> entry : tmap2.entrySet()) { long count = entry.getKey(); String name = entry.getValue(); context.write(new LongWritable(count), new Text(name)); } } }
![Page 14: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/14.jpg)
Parallelization requires considerable effort:● Command-specific flags (e.g., sort -p, make -jN)● Mostly-manual, restricted parallelization tools (e.g., GNU parallel)● Full rewrites in parallel frameworks (e.g., MapReduce)
Mostly sequential by default — how to parallelize?
![Page 15: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/15.jpg)
for directory in /project/gutenberg/*/; do ls $directory | grep 'txt' | wc -l > index.txtdone
cat f1 f2 |
echo 'Done';
tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed ${1}q tr A-Z a-z | | | | |
split aggregate
(1) Numerous and opaque Unix commands
(2) Shell language enforced dependencies
(3) Runtime support for Unix parallelization
Challenges of Automating Shell-Script Parallelization
![Page 16: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/16.jpg)
Compile
seq.sh
cat $f1 f2 |sort
Parse
Optimize
Unparse
ASTcat f1 f2
sort
|
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
DFGAnnotations
RuntimeLibrary
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
1 23
![Page 17: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/17.jpg)
Compile
seq.sh
cat $f1 f2 |sort
Parse
Optimize
Unparse
ASTcat f1 f2
sort
|
>
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
DFG
RuntimeLibrary
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
23
Annotations
1
![Page 18: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/18.jpg)
1. Unix Parallelizability Study & Annotations
![Page 19: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/19.jpg)
GNUPOSIX
ScriptsUbuntuPATH
Parallelizability properties:● 4 broad classes● Flags and options● Input consumption
Parallelizability DSL: (cmd, flg, [in]) → DFG node
study
POSIXGNU
![Page 20: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/20.jpg)
4 commandparallelizabilityclasses
12.7% stateless
input.txt
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
tr
tr
cat
![Page 21: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/21.jpg)
+state
+state4 commandparallelizabilityclasses
8.7% parallelizable pure
input.txt
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
wc
wc
agg
12.7% stateless
![Page 22: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/22.jpg)
4 commandparallelizabilityclasses
8.2% non-parallelizable pure
input.txt
It was the best of times, it was theworst of times, it was the age ofwisdom, it was the age offoolishness, it was the epoch ofbelief, it was the epoch ofincredulity, it was the season ofLight, it was the season ofDarkness, it was the spring ofhope, it was the winter of despair.
sha1shum
sha1shum
x
+state
+state
8.7% parallelizable pure 12.7% stateless
![Page 23: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/23.jpg)
4 commandparallelizabilityclasses
70.4% side-effectful
mv
8.2% non-parallelizable pure 8.7% parallelizable pure
12.7% stateless
![Page 24: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/24.jpg)
Compile
seq.sh
cat $f1 f2 |sort
Parse
Optimize
Unparse
ASTcat f1 f2
sort
|
>
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
DFG
RuntimeLibrary
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
23
Annotations
1
![Page 25: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/25.jpg)
Compile
Annotations
1
seq.sh
cat $f1 f2 |sort
Parse
Unparse
ASTcat f1 f2
sort
|
>
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
RuntimeLibrary
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
3
Optimize
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
DFG
2
![Page 26: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/26.jpg)
2. Dataflow Model & Transformations
![Page 27: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/27.jpg)
cat f1 f2 > out.txt; cat out.txt
DFG1 DFG2cat cat
f1
f2out out
Scheduling constraint
![Page 28: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/28.jpg)
DFG1cat
f1
f2tr
cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
sort out
![Page 29: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/29.jpg)
DFG1cat
f1
f2split
cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
sort out
tr
trcat
Transformation condition: tr is stateless
![Page 30: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/30.jpg)
DFG1
f1
f2
cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
sort out
tr
trcat
Transformation condition: cat followed by split
![Page 31: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/31.jpg)
DFG1
f1
f2
cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
out
tr
trcat
Transformation condition: sort is parallellizable pure
splitsort
sortmerge
![Page 32: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/32.jpg)
DFG1
f1
f2
cat f1 f2 | tr A-Z a-z | sort > out.txt; cat
out
tr
tr
sort
sortmerge
Transformation condition: cat followed by split
![Page 33: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/33.jpg)
DFG
cat grep
grep
grep
grep
cat τ
DFG
cmd
DFG
cmd τ1
cat
DFG
τ2
DFG
τ3
DFG
relay
DFG
cat split
1 + 3 Transformations
![Page 34: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/34.jpg)
Compile
Annotations
1
seq.sh
cat $f1 f2 |sort
Parse
Unparse
ASTcat f1 f2
sort
|
>
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
RuntimeLibrary
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
3
Optimize
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
DFG
2
![Page 35: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/35.jpg)
Compile
seq.sh
cat $f1 f2 |sort
Parse
Optimize
Unparse
ASTcat f1 f2
sort
|
>
DFG
f1
f2cat sort
Optimized DFG
f1
f2sort -m
sort
sort
DFGAnnotations
Emit
PaSh Overview
ASTmkfifo a b
;
&
>
sort f1 a
1 2
par.sh
mkfifo a bsort f1 > a &sort f2 > b &sort -m a b &wait;rm -f a b
RuntimeLibrary
3
![Page 36: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/36.jpg)
3. Runtime Support
![Page 37: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/37.jpg)
● Unix pipes are lazy, i.e., inadequate buffering (and for a good reason)
● Dataflow graph termination is tricky
● Parallelizable-pure commands require careful aggregation
Runtime Support: Performance & Correctness
![Page 38: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/38.jpg)
A non-solution: using files instead of fifos
Runtime Challenge: Unix's Lazy Semantics
grep
grepcat
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
![Page 39: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/39.jpg)
Runtime Challenge: Unix's Lazy Semantics
1grep
grepcat
2
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
![Page 40: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/40.jpg)
Runtime Challenge: Unix's Lazy Semantics
1grep
grepcat
2
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
![Page 41: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/41.jpg)
Runtime Challenge: Unix's Lazy Semantics
1grep
grepcat
2
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
![Page 42: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/42.jpg)
Runtime Challenge: Unix's Lazy Semantics
1grep
grepcat
2
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
![Page 43: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/43.jpg)
Runtime Challenge: Unix's Lazy Semantics
1grep
grepcat
2
mkfifo f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
cat f1 f2
Execution proceeds in steps!
![Page 44: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/44.jpg)
Among other problems, this "solution" preventspipeline parallelism (more on that later)
A non-solution: Use intermediary files...
grep
grep
touch f1 f2
grep "foo" in1 > f1 &
grep "foo" in2 > f1 &
wait
cat f1 f1
cat
f1
f2
f1
f2
![Page 45: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/45.jpg)
The PaSh Solution: Eager Buffers
grep
grep
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
eager < f1 > f3 &
eager < f2 > f4 &
cat f3 f4
cat
eager
eager
![Page 46: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/46.jpg)
The PaSh Solution: Eager Buffers
grep
grep
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
eager < f1 > f3 &
eager < f2 > f4 &
cat f3 f4
cat
eager
eager
![Page 47: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/47.jpg)
The PaSh Solution: Eager Buffers
grep
grep
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
eager < f1 > f3 &
eager < f2 > f4 &
cat f3 f4
cat
eager
eager
![Page 48: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/48.jpg)
The PaSh Solution: Eager Buffers
grep
grep
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
eager < f1 > f3 &
eager < f2 > f4 &
cat f3 f4
cat
eager
eager
![Page 49: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/49.jpg)
The PaSh Solution: Eager Buffers
grep
grep
mkfifo f1 f2 f3 f4
grep "foo" in1 > f1 &
grep "foo" in2 > f2 &
eager < f1 > f3 &
eager < f2 > f4 &
cat f3 f4
cat
eager
eager
/pash/runtime/eager
● Unix command, usable outside PaSh too
● Buffers input eagerly — can spill to disk
● Keeps fragment in DFG model
![Page 50: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/50.jpg)
Demo Time!
![Page 51: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/51.jpg)
Evaluation
![Page 52: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/52.jpg)
1. Expert / Classic Scripts
Configurations
Speedups against bash baselinefor pash --width=16:
5.93× vs. 8.83×
Word-coun
ting script
shown before
No runtime-support baseline
![Page 53: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/53.jpg)
Parallelizable Non parallelizable
+ PaSh awareness goes a long way!
cat $IN6 | awk '{print $2, $0}' | sort -nr | cut -d ' ' -f 2 (1.01×)e.g. #26 cat $IN6 | sort -nr -k2 | cut -d ' ' -f 1 (8.1× !!1!1)
2. Pipelines in the wild
Configuration:Full PaSh --width=16
![Page 54: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/54.jpg)
Hadoop only foc
uses
on this part
This part is not the focus of traditional parallelization frameworks but parallelizing it has the biggest impact
3. Case Study no.1: NOAA Weather Analysis
fetch, preprocess, cleanup, filter calculate
Configuration:Full PaSh --width=1682GB (5y data)
33m58s 10m4s
pash -w 16
bash
2.52×combined speedupfor the full program
12.31×speedup for preprocessing
2.04×speedup for preprocessing
16m39s 49s
![Page 55: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/55.jpg)
Conclusion
![Page 56: PaSh: Light-Touch Data-Parallel Shell Processing](https://reader030.fdocuments.us/reader030/viewer/2022012613/61970414c801ba2c3b5bfa89/html5/thumbnails/56.jpg)
Conclusion
● Parallelize unix shell scripts (POSIX -> POSIX)
● Annotations address extensibility issues
● Open source — 12+ contributors
● Lots of recent excitement — let's rehabilitate the shell!
[email protected] github.com/andromeda/pash