Reproducible bioinformatics pipelines with Docker and Anduril
-
Upload
christian-frech -
Category
Science
-
view
2.464 -
download
9
Transcript of Reproducible bioinformatics pipelines with Docker and Anduril
![Page 1: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/1.jpg)
1
Reproducible Bioinformatics Pipelines with Docker & Anduril
Christian Frech, PhDBioinformatician at Children‘s Cancer Research Institute, Vienna
CeMM Special SeminarSeptember 25th, 2015
![Page 2: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/2.jpg)
Why care about reproducible pipelines in bioinformatics?
For your (future) self Quickly re-run analysis with different parameters/tools Best documentation how results have been produced
For others Allow others to easily reproduce your findings
(“reproducibility crisis”)*
Code re-use between projects and colleagues
2
*) http://theconversation.com/science-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998
![Page 3: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/3.jpg)
Obstacles to computational reproducibility
Software/script not available (even upon request) Black box: Code (or even virtual machine) available, but no
documentation how to run it Dependency hell: Software and documentation available,
but (too) difficult to get it running
Code rot: Code breaks over time due to software updates 404 Not Found: unstable URLs, e.g. links to lab homepages
3
Go figure…
![Page 4: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/4.jpg)
Computational pipelines to the rescue
In bioinformatics, data analysis typically consists of a series of heterogeneous programs stringed together via file-based inputs and outputs
Example: FASTQ -> alignment (BWA) -> variants calling (GATK) -> variant annotation (SnpEff) -> custom R script
Simple automation via (bash/R/Python/Perl) scripting has its limitations
No error checking No partial execution No parallelization
4
![Page 5: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/5.jpg)
No shortage of pipeline frameworks Script-based
GNU Make, Snakemake, Bpipe, Ruffus, Drake, Rake, Nextflow, …
GUI-based Galaxy, GenePattern, Chipster, Taverna, Pegasus, … Various commercial solutions for more standardized
workflows (e.g. RNA-seq) Geared toward biologists without programming skills
(“point-and-click”)
5See also https://www.biostars.org/p/79, https://www.biostars.org/p/91301/
![Page 6: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/6.jpg)
Personal wish list for pipeline framework
Script-based (maximum flexibility, minimum overhead) Powerful scripting language Cluster integration (preferably via slurm) Modular (allow code re-use b/w projects and colleagues) Component library for frequent tasks (e.g. join two CSV files) Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment
6
![Page 7: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/7.jpg)
What’s wrong with good ol’ GNU make?
Available on all Linux platforms Stood the test of time
(developed in 1970s) Rapid development
(Bash scripting + target rules) Multi-threading (-j parameter)
7
No cluster support Arcane syntax, cryptic pattern
rules Half-baked multi-output rules No type checking (everything is a
generic file) Difficult to modularize
(code re-use) Rebuild not triggered by recipe
change No reporting
PRO CON
![Page 9: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/9.jpg)
Anduril Developed since 2008 at Biomedicum Systems Biology Laboratory,
Helsinki, Finland http://research.med.helsinki.fi/gsb/hautaniemi/
Built for scientific data analysis with focus on bioinformatics Proprietary workflow scripting language “Anduril script”
Possibility to embed native code (Bash/R/Python/Perl) Version 2 will switch to Scala
Open source & free Significo (http://www.significo.fi/) is commercial spin-off offering Anduril
consulting services No widespread adoption (yet?)
9
![Page 10: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/10.jpg)
Anduril features Script-based (maximum flexibility, less overhead) Expressive scripting language Cluster integration (preferably via slurm) Modular to allow code re-use (b/w projects and colleagues) Ready-made component library for frequent analysis steps Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment
10
X
![Page 11: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/11.jpg)
Example workflow: RNA-seq alignment with GSNAP
inputBamDir = INPUT(path="/data/bam", recursive=false)inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",@cpu = 10, @memory = 40000,@name = "gsnap_" + bam.key
)alignedBams[bam.key] = gsnap.alignment
}
11
Anduril script
Execute with$ anduril run workflow.and --exec-mode slurm
Distributed execution on cluster
![Page 12: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/12.jpg)
Embedding native R code in Anduril script
12
ensembl = REvaluate(table1 = ucsc,script = StringInput(content=
'''table.out <- table1table.out$chrom <- gsub("^chr", "", table.out$chrom)'''
))
Supports also inlining of Bash, Python, Java, and Perl scripts
Convert UCSC to Ensembl chromosome names in a CSV file containing column ‘chrom’:
![Page 13: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/13.jpg)
Anduril features Script-based (maximum flexibility, less overhead) Expressive scripting language Cluster integration (preferably via slurm) Modular to allow code re-use (b/w projects and colleagues) Ready-made component library for frequent analysis steps Reporting (HTML, PDF) to share results Free & open-source Bundle scripts/data with execution environment
13
?
![Page 14: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/14.jpg)
“Lightweight” virtualization technology for Unix-based systems Processes run in isolated namespaces (“containers”), but share same kernel Like VMs: containers portable between systems -> reproducibility! Unlike VMs: instant startup, no resource pre-allocation -> better hardware utilization
14
VM Container
![Page 15: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/15.jpg)
How to bundle workflow with execution environment?
15
ContainerAnduril
Workflow
Component 1
Component 2
Component 3
Pro: Single container, easy to maintainCon: VM-like approach; huge, monolithic container, difficult to share (against Docker philosophy)
Pro: Completely modularized, easy to re-use/share workflow componentsCon: “container hell”?
Workflow
Anduril
Solution 1 Solution 2
Container AComponent 1
Container BComponent 2
Container CComponent 3
![Page 16: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/16.jpg)
Hybrid solution
16
Pro: Workflow completely containerized (= portable); only shared components in common containersCon: Still (but greatly reduced) overhead for container maintenance
WorkflowAnduril
Container AComponent 1
Component 2
Component 3
Master containerProject- and user-specific components installed in master container
Shared components installed in common container (e.g. container “RNA-seq”)
“Docker inside docker”
![Page 17: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/17.jpg)
Dockerized GSNAP in Anduril
17
inputBamDir = INPUT(path="/data/bam", recursive=false)inputBamFiles = Folder2Array(folder1 = inputBamDir, filePattern = "C57C3ACXX_CV_([^_]+)_.*[.]bam$")
alignedBams = record()for bam : std.iterArray(inputBamFiles) {
gsnap = GSNAP (reads = INPUT(path=bam.file), options = "--npaths=1 --max-mismatches=1 --novelsplicing=0",docker = "cfrech/anduril-gsnap-2015-09-21",@cpu = 10, @memory = 40000,@name = "gsnap_" + bam.key
)alignedBams[bam.key] = gsnap.alignment
}
![Page 18: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/18.jpg)
So, Anduril is great… but Proprietary scripting language
Biggest hurdle for widespread adoption IMO Will likely improve with version 2 (which uses Scala)
Documentation opaque for beginners WANTED: Simple step-by-step guide to build your first Anduril workflow
High upfront investment to get going (because of the above) In-lining Bash/R/Perl/Python should be simpler
Currently too much clutter when using “BashEvaluate” and alike Coding in Anduril sometimes “feels heavy” compared to other frameworks
(e.g. GNU Make) Will improve with fluency in workflow scripting language
18
![Page 19: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/19.jpg)
Anduril RNA-seq case study
19
![Page 20: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/20.jpg)
RNA-seq case studyStep 1: Configure Anduril workflow
title = “My project long title“shortName = “My project short title“authors = "Christian Frech"
// analyses to run
runNetworkAnalysis = truerunMutationAnalysis = truerunGSEA = true
// constants
PROJECT_BASE="/mnt/projects/myproject“gtf = INPUT(path=PROJECT_BASE+"/data/Homo_sapiens.GRCh37.75.etv6runx1.gtf.gz")referenceGenomeFasta = INPUT(path="/data/reference/human_g1k_v37.fasta")
...
20
+ description of samples, sample groups, and group comparisons in external CSV file
![Page 21: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/21.jpg)
RNA-seq case studyStep 2: Run Anduril workflow on cluster
$ anduril run main.and --exec-mode slurm
21
![Page 22: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/22.jpg)
RNA-seq case studyStep 3: Go for lunch
22
![Page 23: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/23.jpg)
RNA-seq case studyStep 4: Study PDF report
23
![Page 24: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/24.jpg)
What follows are screenshots from this PDF report
24
![Page 25: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/25.jpg)
QC: Read counts
25
![Page 26: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/26.jpg)
QC: Gene body coverage
26
![Page 27: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/27.jpg)
QC: Distribution of expression values per sample
27
![Page 28: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/28.jpg)
QC: Sample PCA & heatmap
28
![Page 29: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/29.jpg)
Vulcano plot for each comparison
29
![Page 30: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/30.jpg)
Table report of DEGs for each comparison
30
![Page 31: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/31.jpg)
Expression values of top diff. expressed genes per comparison
31
![Page 32: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/32.jpg)
GO term enrichment for each comparison
32
![Page 33: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/33.jpg)
Interaction network of DEGs for each comparison
33
![Page 34: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/34.jpg)
Chromosomal distribution of DEGs
34
![Page 35: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/35.jpg)
GSEA heat map summarizing all comparisons
35
Rows = enriched gene setsColumns = comparisonsValue = normalized enrichment score (NES)Red = enriched for up-regulated genesBlue = enriched for down-regulated genes* = significant (FDR < 0.05)** = highly significant (FDR < 0.01)
![Page 36: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/36.jpg)
Future developments Push new Anduril components to public repository
(needs some refactoring, documentation, test cases) Help on Anduril2 manuscript Port custom Makefiles to Anduril (ongoing) Cloud deployment of dockerized workflow
Couple slurm to AWS EC2 Automatic spin-up of docker-enabled AMIs serving as
computing nodes
36
![Page 37: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/37.jpg)
In the (not so) distant future …
$ docker pull cfrech/frech2015_et_al
$ docker run cfrech/frech2015_et_al --use-cloud --max-nodes 300 --out output
$ evince output/figure1.pdf
37
![Page 38: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/38.jpg)
Further reading
Discussion thread on Docker & Andurilhttps://groups.google.com/forum/#!msg/anduril-dev/Et8-YG9O-Aw/24i4M1pDIfcJ
38
![Page 39: Reproducible bioinformatics pipelines with Docker and Anduril](https://reader035.fdocuments.us/reader035/viewer/2022062218/58f000881a28abd0478b46b1/html5/thumbnails/39.jpg)
Acknowledgement
39
Marko Laakso (Significo) Sirku Kaarinen (Significo) Kristian Ovaska (Valuemotive) Pekka Lehti (Valuemotive) Ville Rantanen (University of
Helsinki, Hautaniemi lab) Nuno Andrade (CCRI) Andreas Heitger (CCRI)