kronos Documentation - Read the Docskronos Documentation, Release 2.2.0 Kronosis a highly ﬂexible...

kronos DocumentationRelease 2.2.0

M. Jafar Taghiyar

July 15, 2016

Contents

1 Table of Contents 31.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Kronos package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Kronos features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Kronos commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

make_component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5make_config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5update_config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Pipeline_info section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 General section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Shared section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10IO connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.3.4 Samples section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.5 Task section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Reserved subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Run subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

use_cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13num_cpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13forced_dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13add_breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13env_var . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15parallel_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15parallel_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16interval_file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Component subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.6 More on the configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

configuration file flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17configuration file keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

i

configuration file reserved keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Output directory customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.4 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4.1 Develop a component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

Component_main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20focus method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22make_cmd method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22ComponentAbstract class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Component_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Component_reqs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Component_ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27component_seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.5.1 Create a new pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

1.5.2 Launch a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291. Run the pipeline using run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Input options of run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30On --qsub-options option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Initialize using run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Run the tasks locally, on a cluster or in the cloud . . . . . . . . . . . . . . . . . . . . 32

2. Run the pipeline using init command and the resulting pipeline script . . . . . . . . . . 32Samples file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Setup file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Run the pipeline script generated by init command . . . . . . . . . . . . . . . . . . 34What is the components directory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Results generated by a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the working directory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the run ID? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the structure of the results directory generated by a pipeline? . . . . . . . . . 35How can I relaunch a pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

1.6 Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371.6.1 Quick tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Make a component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Make a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Run a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46How to run the pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471.6.2 Deploy Kronos to the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Setup StarCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Installing StarCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Creating an EBS volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Launching your cloud cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Setup Kronos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511.7 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521.8 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

1.8.1 Questions and feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

ii

kronos Documentation, Release 2.2.0

Kronos is a highly flexible Python-based software tool that mainly enables bioinformatics developers, i.e. bioin-formaticians who develop workflows for analyzing genomic data, to quickly make a workflow. It uses Ruffus as theunderlying workflow management system and adds a level of abstraction on top of it, which significantly reducesprogramming overhead for workflow development and provides a mechanism to represent a workflow by a top levelYAML configuration file.

Each resulting workflow is portable, can run either locally or on a cluster, parallelizes tasks automatically, and logsall runtime events. The workflows are also highly modular and can be easily updated by editing their correspondingconfiguration files.

Kronos is free and open source under the MIT license and has a Docker image too. Although we have developed itwith bioinformaticians in mind, Kronos can be used in any area where a workflow is required to run a series of taskson a given set of input data.

Note: Throughout this documentation we will use workflow and pipeline interchangeably.

Contents 1

http://www.ruffus.org.uk/http://yaml.org/https://hub.docker.com/r/jtaghiyar/kronos/


2 Contents

CHAPTER 1

Table of Contents

1.1 Getting started

1.1.1 Download

You can get the Kronos package from PyPi. You can also clone it from GitHub repository.

1.1.2 Dependencies

You need to have the following dependencies installed:

Program/package Versionpython 2.7.*ruffus 2.4.1PyYaml 3.11drmaa-python 0.7.6

where drmaa-python is optional and you will need to install it only if you want to use -b drmaa when runninga workflow on a cluster.

Info

PyYaml and ruffus are installed as dependencies when installing Kronos. So, you would only need toinstall drmaa-python.

1.1.3 Install

You can install Kronos using pip:

pip install kronos-pipeliner

or upgrade it by:

pip install --upgrade kronos-pipeliner

3

https://pypi.python.org/pypi/kronos-pipelinerhttps://github.com/jtaghiyar/kronos


Tip

For a quick start, without having to go through all the details, you can directly refer to Quick tutorial.

1.2 Kronos package

This section explains:

• Kronos features

• Kronos commands

1.2.1 Kronos features

Kronos package offers the following features which eliminate the difficulties of making a pipeline:

Info

We define a pipeline (also called workflow) as a DAG composed of different tasks.

• Single configuration file: the whole pipeline can be configured using a single configuration file.

• Parallelization: parallelizable tasks are automatically run in parallel.

• Synchronization: parallel tasks can be synchronized based on any of their parameters.

• Local, cluster and cloud support: the pipelines can be run locally or on a cluster of computing nodes or in thecloud.

• Forced dependencies: any task can be forced to wait for any other tasks.

• Breakpoints: a pipeline can be programmatically paused and restated from any point in the pipeline.

• Boilerplates: an executable boilerplate or script can be injected to a task and is run prior to running the taskitself.

• Keywords: a set of specific keywords in the configuration file which will be automatically replaced by propervalues in the runtime.

• Parameter sweep: a pipeline can be run for a list of different values for a set of input arguments.

• Output directory customization: the structure of the output directory where all the intermediate files and resultsare stored can be configured by the user in the configuration file.

• Event logging: all the events are automatically logged.

1.2.2 Kronos commands

Once Kronos is installed, it is added to the PATH, i.e. kronos becomes an available command which has the followingsub-commands:

4 Chapter 1. Table of Contents

http://en.wikipedia.org/wiki/Directed_acyclic_graph


Command Descriptionmake_component make a new component templatemake_config make a new configuration fileupdate_config copy the fields of old configuration file to new configuration fileinit initialize a pipeline from the given configuration filerun run Kronos-made pipelines with optional initialization

as well as the following options:

Options Description-h or –help print help - optional-v or –version show program’s version number and exit - optional-w or –working_dir path/to/working_dir - optional

Tip

The -w is optional and if not specified, the current working directory is used to save output files/directories. Itis recommended to specify it to avoid overwriting existing files. See What is the working directory? for moreinformation.

make_component

This command creates a new component template. In other words, it automatically generates wrappers required for aseed to become a component.

Info

See Components for more information on seed and component.

The command is used as follows:

kronos -w make_component

For example, the following code creates a component template called my_comp in a directory calledmy_components_dir:

kronos -w my_components_dir make_component my_comp

make_config

This command makes a new configuration file for the given list of component names.


kronos -w make_config -o

For example, the following code creates a new configuration file called my_config_file.yaml for two compo-nents comp1 and comp2 in a directory called my_working_dir:

kronos -w my_working_dir make_config comp1 comp2 -o my_config_file

1.2. Kronos package 5


Warning: It is required to export the path of the components directory to the PYTHONPATH environment variableprior to running the make_config command:

export PYTHONPATH=:$PYTHONPATH

Tip

Note that the suffix .yaml is automatically added to the end of the provided name for the configuration file.

update_config

This command replaces the corresponding fields of an old configuration file with that of a new one. This is usefulwhen there is a large configuration file which needs to be updated.


kronos -w update_config -o

For example, the following code creates a new configuration file called new_config_file.yaml by updatingmy_config_file1.yaml using my_config_file2.yaml in a directory called my_working_dir:

kronos -w my_working_dir update_config my_config_file1.yaml my_config_file2.yaml -o new_config_file

init

This command initializes a new pipeline (i.e. creates a Python script) based on the input configuration file.

Info

We call a resulting Python script a pipeline script too.


kronos -w init -y -e

For example, the following code creates a Python script called my_pipeline.py for the input configuration filemy_config_file.yaml in a directory called my_working_dir:

kronos -w my_working_dir init -y my_config_file.yaml -e my_pipeline

The output Python script of this command can be run using Kronos run command or can be run directly as a Pythonscript.

Info

See How to initialize a pipeline? for more information.

Tip

Note that the suffix .py is automatically added to the end of the provided name for the pipeline.



Warning: The init command might create the following directories in addition to the pipeline Python script:• intermediate_config_files• intermediate_pipeline_scripts

These directories are used by Kronos and users should NOT modify them.

run

This command runs Kronos-made pipelines, i.e. pipeline scripts made by init command.


kronos run -k -c [options]

Warning: It is required to export the path of the components directory to the PYTHONPATH environment variableprior to running the run command:

export PYTHONPATH=:$PYTHONPATH

Info

You can use run command to initialize and run the pipeline using the configuration file directly (i.e. without theneed to init first). See Run the pipeline using run command for more information.

1.3 Configuration file

A configuration file generated by Kronos is a YAML file which describes a pipeline. It contains all the parametersof all the components in the pipeline as well as the information that builds the flow of the pipeline.

A configuration file has the following major sections (shown in red ovals in the following figure):

• __PIPELINE_INFO__

• __GENERAL__

• __SHARED__

• __SAMPLES__

• __TASK_i__

where __TASK_i__ has the following subsections (shown in green ovals in the following figure):

• reserved

• run

• component

1.3.1 Pipeline_info section

The __PIPELINE_INFO__ section stores information about the pipeline itself and looks like the following:

1.3. Configuration file 7

http://yaml.org/


__PIPELINE_INFO__:name: nullversion: nullauthor: nulldata_type: nullinput_type: nulloutput_type: nullhost_cluster: nulldate_created: nulldate_last_updated: nullkronos_version: '2.0.0'

where:

• name: a name for the pipeline

• version: version of the pipeline

• author: name of the developer of the pipeline

• data_type: this can be used for database purposes

• input_type: type of the input files to the pipeline

• output_type: type of the output files of the pipeline

• host_cluster: a name a cluster used to run the pipeline or ‘null’ if the pipeline is designed to run onlylocally

• date_created: date that the pipeline is created

• date_last_updated: last date that the pipeline is updated

• kronos_version: version of the kronos package that has generated the configuration file and is addedautomatically

Info

All these fields are merely informative and do not have any impacts on the flow of the pipeline.

1.3.2 General section

__GENERAL__ section contains key:value pairs derived automatically from the requirements field ofthe Component_reqs file of the components in the pipeline. Each key corresponds to a particular requirement,e.g. Python, java, etc., and each value is the path to where the key is. For instance, if there is python:/usr/bin/python entry in the requirements of a component in the pipeline, then you would have the follow-ing in the __GENERAL__ section:

__GENERAL__:python: '/usr/bin/python'

Now, let assume there is another Python installations on your machine in /path/my_python/bin/python andyou prefer to use this instead. You can simply change the path to the desired one:

__GENERAL__:python: '/path/my_python/bin/python'



Warning: This will overwrite the path of python installation specified in the requirements of ALL the compo-nents , hence the name GENERAL. If you want to change the path for only one specific task, then you should usethe requirements entry in the run subsection of that task. Note that the task’s requirements entry takes precedenceover the __GENERAL__ section.

1.3.3 Shared section

In __SHARED__ section you can define arbitrary key:value pairs and then use the keys as variables in the tasksections. This helps you to parameterize the task sections. The mechanism that enables you to use variables is calledconnection.

Connections

A connection is simply a tuple, i.e. (x1, x2), where its first entry is always a section name, e.g. __SHARED__, andthe second entry is a key in that section, e.g. (’__SHARED__’, ’key1’) which means: ‘use the value assignedto the key1 in the __SHARED__ section’. For example, in the following configuration file, the value of the parameterreference of __TASK_1__ will be ’GRCh37-lite.fa’ at runtime:

__SHARED__:ref: 'GRCh37-lite.fa'

__TASK_1__:component:

input_files:reference: ('__SHARED__', 'ref')

Tip

A connection to the __SHARED__ section, i.e. its first entry is __SHARED__, is called a shared connection.

Tip

It is recommended to use shared connections for the parameters in different tasks that expect the same valuefrom users.

IO connection

An IO connection is a connection whose first entry is a task name and its second entry is a parameter of that task, e.g.(’__TASK_n__’, ’param1’) where param1 is a parameter in __TASK_n__. For instance, in the followingconfiguration, (’__TASK_1__’, ’out_file’) is an IO connection which points to the out_file parameterof __TASK_1__. This connection means: ‘use the value assigned to the out_file parameter of __TASK_1__for the in_file parameter of __TASK_2__. The value of the parameter in_file of __TASK_2__ will be’some_file’ at runtime.

__TASK_1__component:

out_file: 'some_file'

__TASK_2__



component:in_file: ('__TASK_1__', 'out_file')

1.3.4 Samples section

__SAMPLES__ section contains key:value pairs with a unique ID for each set of the pairs. It enables users to runthe same pipeline for different sets of input arguments at once, i.e. users can perform parameter sweep. kronos willrun the pipeline for all the sets simultaneously, i.e. in parallel mode.

For example, for the following configuration file, kronos will make two intermediate pipelines and runs them inparallel. In one of the intermediate pipelines the values of tumour and normal parameters of __TASK_1__ are’DAX1.bam’ and ’DAXN1.bam’, respectively, while in the other one they are ’DAX2.bam’ and ’DAXN2.bam’,respectively.

__SAMPLES__:ID1:

tumour: 'DAX1.bam'normal: 'DAXN1.bam'

ID2:tumour: 'DAX2.bam'normal: 'DAXN2.bam'

__TASK_1__:component:

input_files:tumour: ('__SAMPLES__', 'tumour')normal: ('__SAMPLES__', 'normal')

The ID of each set of input arguments, e.g. ID1 or ID2, is used by kronos to create intermediate pipelines.

Warning: Each ID in the __SAMPLES__ section must be unique, otherwise their corresponding results will beoverwritten.

Warning: kronos creates the following directories in the working directory to store the intermediate pipelines:• intermediate_config_files• intermediate_pipeline_scripts

Users should NOT modify them.

Tip

A connection to the __SAMPLES__ section, i.e. its first entry is __SAMPLES__, is called a sample connection.

The differences between __SAMPLES__ and __SHARED__ sections are:

• a unique ID is required in the __SAMPLES__ section for each set

• a separate individual pipeline is generated for each set of key:value pairs, i.e. for each ID, in the__SAMPLES__ section

Tip

The number of simultaneous parallel pipelines can be set by the user when running the pipeline using the inputoption -n.



1.3.5 Task section

Each task section in a configuration file corresponds to a component. The name of a task section follows the convention__TASK_i__ where i is a number used to make the name unique, e.g. __TASK_1__ or __TASK_27__. If atask is run in parallel then there will be sections with names __TASK_i_j__ which refer to the children of task__TASK_i__, e.g. __TASK_1_1__, __TASK_1_2__, etc. Each task section has following subsections:

• reserved

• run

• component

Reserved subsection

This subsection contains information about the component of the task:

reserved:# do not change this sectioncomponent_name: 'name_of_component'component_version: 'version_of_component'seed_version: 'version_of_seed'

Warning: The information in this subsection should NOT be altered by users and are automatically specified bykronos.

Run subsection

This subsection is used to instruct the kronos how to run the task. It looks like the following example:

run:use_cluster: Falsememory: '5G'num_cpus: 1forced_dependencies: []add_breakpoint: Falseenv_vars:boilerplate:requirements:parallel_run: Falseparallel_params: []interval_file:

use_cluster

You can determine if each task in a pipeline should be run locally or on a cluster using the boolean flaguse_cluster. Therefore, in a single pipeline some tasks might be run locally while the others are submitted toa cluster.

Warning: If use_cluster: True, then pipeline should be run on a grid computer cluster. Otherwise you’llsee the error message failed to load ClusterJobManager and pipeline would eventually fail.



Warning: If use_cluster: True, make sure you pass the correct path for the drmaa library specified by -doption (see Input options of run command for more information on input options). The default value for -d optionis $SGE_ROOT/lib/lx24-amd64/libdrmaa.so where SGE_ROOT environment variable is automaticallyadded to the path, so you only need to specify the rest of the path if it is different than the default value.

memory

If you submit a task to a cluster, i.e. use_cluster: True, then memory specifies the maximum amount ofmemory requested by the task.

num_cpus

If you submit a task to a cluster, i.e. use_cluster: True, then num_cpus specifies the number of coresrequested by the task.

forced_dependencies

You can force a task to wait for some other tasks to finish by simply passing the list of their names to the attributeforced_dependencies of the task. For example, in the following config __TASK_1__ is forced to wait for__TASK_n__ and __TASK_m__ to finish running first.

__TASK_1__:run:

forced_dependencies: [__TASK_n__, __TASK_m__]

Tip

forced_dependencies always expects a list, e.g. [], [__TASK_n__], [__TASK_n__, __TASK_m__].

Info

A dependency B for task A means that task A must wait for task B to finish first, then task A starts to run.

Info

If there is an IO connection between two tasks, then an implicit dependency is inferred by kronos.

add_breakpoint

A breakpoint forces a pipeline to pause. If add_breakpoint: True for a task, pipeline will stop running afterthat task is done. Once the pipeline is relaunched, it will resume running from where it left off. This mechanism has anumber of applications:

• if a part of a pipeline needs user’s supervision, for example to visually inspect some output data, then adding abreakpoint can pause the pipeline for the user to make sure everything is as desired and the relaunch from thatpoints.



• you can run a part of a pipeline several times, for example to fine tune some of the input arguments. This canhappen by adding breakpoint to the start and end tasks for that part of the pipeline and relaunch the pipelineevery time.

• you can run different parts of a single pipeline on different machines or clusters provided that the pipeline canaccess the files generated by the previous runs. For instance, you can run a pipeline locally up to some point (abreakpoint) and then relaunch the pipeline on a different machine or cluster to finish the rest of the tasks.

Tip

If a task is parallelized and it has add_breakpoint: True, then the pipeline waits for all the children ofthe task to finish running and then applies the breakpoint.

Note: When a breakpoint happens, all the running tasks are aborted.

env_var

You can specify a list of the environment variables, required for a task to run successfully, directly in the configurationfile. It looks like the following:

__TASK_n__:run:

env_vars:var1: value1var2: value2

Tip

If an environment variable accepts a list of values, you can pass a list to that environment variable. For example:

env_vars:var1: [value1, value2, ...]

boilerplate

Using this attribute you can insert a command or an script, or in general a boilerplate, directly into a task. Theboilerplate is run prior to running the task. For example, assume you need to setup your python path using loadmodule command. You can either pass the command as follows:

__TASK_n__:run:

boilerplate: 'load module python/2.7.6'

or save it in a file e.g. called setup_file:

load module python/2.7.6

and pass the path to the file, e.g. /path/to/setup_file, to the boilerplate attribute:

__TASK_n__:run:

boilerplate: /path/to/setup_file


http://modules.sourceforge.net/http://modules.sourceforge.net/


requirements

Similar to the __GENERAL__ section, this entry contains a list of key:value pairs derived automatically from therequirements field of the Component_reqs file of the component. The difference is that this list contains only therequirements for this task and applies only to this task and not the rest of the tasks in the pipeline.

It looks like the following:

__TASK_n__:run:

requirements:req1: value1req2: value2

Tip

This entry takes precedence over the __GENERAL__ section. If you want to get the values for the requirementsfrom the __GENERAL__ section, then simply leave the value for each requirements in this entry blank or passnull. For example:

requirements:req1:req2:

parallel_run

It is a boolean flag that specifies whether or not to run a task in parallel. If parallel_run: True, the task isautomatically expanded to a number of children tasks that are run in parallel simultaneously.

Warning: A task needs to be parallelizable to run in parallel.

Tip

If a task is not parallelizable, the attributes parallel_run, parallel_params and interval_file will NOT be shownin the run subsection and the following message is shown in the configuration file under the run subsection ofthe task:NOTE: component cannot run in parallel mode.Otherwise it is considered parallelizable.

There are two mechanisms a task is parallelized: parallelization

In this mechanism, the task is expanded to its children where the number of its children is determined by one of thefollowing:

• number of lines in the interval_file

• number of chromosomes, if there is no interval file specified

Tip

kronos uses the set [1, 2,..., 22, X, Y] for chromosome names and it parallelizes a task based on this set bydefault if no interval file is specified.



synchronization

If a task has:

• an IO connection to a second task

• and, the parallel_params is also set

then kronos expands the first task as many times as the number of the children of the second task, if the two tasksare synchronizable.

Tip

Two tasks are synchronizable if:1. both are parallelizable, and2. if they are both parallelized, they have the same number of children, and

Note: If any of the conditions mentioned above does not hold true, then kronos automatically merges the resultsfrom the predecessor task and passes the result to the next task.

Tip

If task A is synchronizable with both tasks B and C individually but not simultaneously, then kronos synchro-nizes task A with one of them and uses the merge for the other one.

parallel_params

This attribute controls:

• whether to synchronize a task with its predecessor(s)

• over what parameters the synchronization should happen

It accepts a list of parameters of the task that have IO connection to the predecessors. For instance, if task__TASK_n__ has task __TASK_m__ as its predecessor and has two IO connections with it, e.g. in_param1:(__TASK_m__, ’out_param1’) and in_param2: (__TASK_m__, ’out_param2’). Assuming thatthe two tasks are synchronizable, parallel_params = [’in_param1’] forces the kronos to synchronizethe task __TASK_n__ to task __TASK_m__ over the parameter in_param1. In other words, task __TASK_n__is expanded as many time as the number of the children of task __TASK_m__ and each of its children gets its valuefor in_param1 from the out_param1 of one of the children of task __TASK_m__.

interval_file

An interval file contains a list of intervals or chunks which a task will use as input arguments for its children. Forexample if an interval file looks like:

chunk1chunk2chunk3

then each line, i.e. chunk1, chunk2, chunk3, will be passed separately to a children as an input argument. Thepath to the interval file is passed to the interval_file attribute.



Warning: If you want to use the interval file functionality in a task, the component of that task should support it.In other words, it should have the focus method in its component_main module. This method determines how andto which parameter a chunk should be passed.

Component subsection

This subsection contains all the input parameters of the component of the task. The parameters are categorized intothree subsections:

• input_files: lists all the input files and directories

• output_files: lists all the output files and directories

• parameters: lists all the other parameters

1.3.6 More on the configuration file

configuration file flags

kronos uses the following flags assigned to various parameters of different tasks:

• __REQUIRED__: means that the user MUST specify value for that parameter.

• __FLAG__: means that the parameter is a boolean flag. Users can assign True or False values to theparameter. The default value is False.

Tip

The default values for the parameters appear in the configuration file. If there is no default value, then either oneof the configuration file flags will be used or it is left blank.

Note: Put quotation marks around string values, for example ‘GRCh37.66’. Unquoted strings, while accepted byYAML can result in unexpected behaviour.

configuration file keywords

You can use the following keywords in the configuration file which will be automatically replaced by proper values atruntime:

Keyword Description$pipeline_name the name of the pipeline$pipeline_working_dir the path to the working directory$run_id run ID$sample_id the ID used in the samples section

Warning: The character $ is part of the keyword and MUST be used.



configuration file reserved keywords

The following words are reserved for the kronos package:

• reserved

• run

• component

Warning: The reserved keywords can NOT be used as the name of parameters of components/tasks.

Output directory customization

kronos supports paths in the output_files subsection of the component subsection. In other words, user canspecify paths like /dir1/dir2/dir3/my.file to the parameters of the output_files subsection and all thedirectories in the path will be automatically made if they do not exist. For example, kronos will make directoriesdir1, dir2, dir3 with the given hierarchy. This mechanism enables developers to make any directory structure asdesired. Basically, they can organize the outputs directory of their pipeline directly from within the configurationfile. For instance, assume a pipeline has two tasks with components comp1 and comp2. The user can categorize theoutputs of these tasks by the names of their corresponding components as follows (note the values assigned to outand log parameters of each component):

__TASK_i__:component:

output_files:out: comp1/res/my_res_name.filelog: comp1/log/my_log_name.log

__TASK_ii__:component:

output_files:out: comp2/res/my_res_name.filelog: comp2/log/my_log_name.log

so, the following tree is made inside the outputs directory given the above configuration file:

outputs|____comp1| |____log| | |TASK_i_my_log_name.log| |____res| |TASK_i_my_res_name.file|____comp2

|____log| |TASK_ii_my_log_name.log|____res

|TASK_ii_my_res_name.file

Tip

Output filenames are always prepended by the task names to prevent overwriting, e.g. TASK_i and TASK_iiin the above example.



Tip

If you want to specify a directory name to a parameter, you can do so by using / character at the end of thedirectory name. This instructs kronos to make the directory in the outputs directory or any other specifiedpath if the direcotry does not exist.

1.4 Components

This part explains how to make new components. You need to know the following definitions first:

seed: a seed is a computer application, a program or in general a command line tool that performs a specific task. Thiscan be a simple bash copy command or a software suite like MutationSeq or Strelka.

component: a component is a wrapper around a seed that makes the seed compatible with kronos so that the seedthen can be used as a part of a pipeline. In other words, components are the building blocks of the pipelines generatedby kronos.

1.4.1 Develop a component

The purpose of components is to modularize workflows with reusable building blocks that require minimal develop-ment. The number of lines of codes for making a new component is very small. The simple development instructionseliminate, for example, the need to use Ruffus decorators, input/output management using regex expressions and com-plicated dependency management in the code that can easily become very complex with the number of tasks in aworkflow. Furthermore, a large workflow can be divided into a set of small components that results in a much fasterand manageable workflow development.

All command line tools can be used as seeds and therefore wrapped as Kronos components. Regardless of how com-plicated they are, their corresponding components have a standard directory structure composed of specific wrappersand sub-directories. The wrappers are agnostic to the programming language used for developing the seed. The com-ponents should be developed prior to making the workflow. However, since they are individually and independentlydeveloped and due to their reusability, the development of a component happens only once and then it can be used invarious pipelines.

In order to develop a component you need to:

1. create a directory with a name that is the same as the name you want for the component, e.g. my_comp.

2. create a directory called “component_seed” inside the my_comp directory and copy the seed source code intoit.

3. create the following files inside the my_comp directory:

• __init__.py: an empty file.

• component_main.py: the main python script that contains the Component class.

• component_params.py: contains all the information about input/output parameters of the component.

• component_reqs.py: contains all the information about the requirements of the component.

• component_ui.py: an argparse UI for the component.

1.4. Components 19

http://compbio.bccrc.ca/software/mutationseq/https://sites.google.com/site/strelkasomaticvariantcaller/https://docs.python.org/3/library/argparse.html


Tip

All the above files and directories are generated by Kronos make_component command. The user only needsto customize them.

Tip

The seed source code is not required to be copied into the component_seed directory. Instead, the seed canbe used as a requirement for the component which can be listed in the component_reqs.py.

The component directory tree looks like the following:

|-- | |-- __init__.py| |-- component_main.py| |-- component_params.py| |-- component_reqs.py| |-- component_ui.py| |-- component_seed

Note: should be replaced with the actual name of the component, e.g. my_comp. The restof the file and directory names should be exactly as shown above.

Tip

It is recommended to add the following files and directories (not generated automatically) as well:• component_test: a directory where all the test files exist.• README: a readme file to provide more information about the component.

Component_main

The core of a component is the component_main.py python script. This module defines Component class whichextends the ComponentAbstract class.

Using the make_component command, the following component_main.py file is generated:

"""component_main.pyThis module contains Component class which extendsthe ComponentAbstract class. It is the core of a component.

Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

from kronos.utils import ComponentAbstractimport os

class Component(ComponentAbstract):



"""TODO: add component doc here."""

def __init__(self, component_name="my_comp",component_parent_dir=None, seed_dir=None):

## TODO: pass the version of the component here.self.version = "v0.99.0"

## initialize ComponentAbstractsuper(Component, self).__init__(component_name,

component_parent_dir, seed_dir)

## TODO: write the focus method if the component is parallelizable.## Note that it should return cmd, cmd_args.def focus(self, cmd, cmd_args, chunk):

pass# return cmd, cmd_args

## TODO: this method should make the command and command arguments## used to run the component_seed via the command line. Note that## it should return cmd, cmd_args.def make_cmd(self, chunk=None):

## TODO: replace 'comp_req' with the actual component## requirement, e.g. 'python', 'java', etc.cmd = self.requirements['comp_req']

cmd_args = []

args = vars(self.args)

## TODO: fill the following component params to seed params dictionary## if the name of parameters of the seed are different than## component parameter names.comp_seed_map = {

#e.g. 'component_param1': 'seedParam1',#e.g. 'component_param2': 'seedParam2',}

for k, v in args.items():if v is None or v is False:

continue

## TODO: uncomment the next line if you are using## comp_seed_map dictionary.# k = comp_seed_map[k]

cmd_args.append('--' + k)

if isinstance(v, bool):continue

if isinstance(v, str):v = repr(v)

if isinstance(v, (list, tuple)):cmd_args.extend(v)

else:cmd_args.extend([v])

1.4. Components 21


if chunk is not None:cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

return cmd, cmd_args

## To run as stand alonedef _main():

c = Component()c.args = component_ui.argsc.run()

if __name__ == '__main__':import component_ui_main()

Note: Note the places you need to change the generated file to make it work for you are marked with keyword‘TODO’.

There are two methods in this file that you need to customize:

• focus

• make_cmd

focus method

Each parallelizable component will require a focus method. The purpose of this method is to tell the componentto process only one chunk of the input data rather than the entire file. How this is done will vary depending on thecomponent, but basically will add to, or alter the component command to this end. For example, in the followingimplementation, focus method simply passes the chunk to the --interval option in the command argumentscmd_arg (most of the time, this implementation does the job):

focus(cmd, cmd_args, chunk):cmd_args.append('--interval ' + chunk)return cmd, cmd_args

Note: You need to implement focus method only if the component is parallelizable.

make_cmd method

All the components should implement this method in their component_main.py. This method essentially returnsthe command string that one can use to run the seed on a command line. For example, if the seed can be run using thefollowing command:

python my_seed_command.py --foo data1 --bar data2

then make_cmd method would look like this (note that we only need to change the first two lines of the default filemade by kronos):

def make_cmd(self, chunk):path = os.path.join(self.seed_dir, 'my_seed_command.py')cmd = self.requirements['python'] + ' ' + path



cmd_args = []

args = vars(self.args)

## TODO: fill the following component params to seed params dictionary## if the name of parameters of the seed are different than## component parameter names.comp_seed_map = {

#e.g. 'component_param1': 'seedParam1',#e.g. 'component_param2': 'seedParam2',

}

for k, v in args.items():if v is None or v is False:

continue

## TODO: uncomment the next line if you are using## comp_seed_map dictionary.# k = comp_seed_map[k]

cmd_args.append('--' + k)





if chunk is not None:cmd, cmd_args = self.focus(cmd, cmd_args, chunk)


Tip

In the above example, python is a requirement for the component and should be added to the compo-nent_reqs.py of the component. Also, parameters foo and bar should be added to the component_params.py.

ComponentAbstract class

This class comprises of the following attributes and methods:

Attributes:

1.4. Components 23


Attribute Descriptionargs the argparse namespace containing all the input arguments from the Component_ui modulecomponents_dir path to the directory where the component existscompo-nent_name

name of the component - specific

compo-nent_params

Component_params module of the component

compo-nent_reqs

Component_reqs module of the component

env_vars see Component_reqsmemory see Component_reqsparallel see Component_reqsrequirements see Component_reqsseed_dir path to the directory where the seed exists. Most of the time it is

/seed_version version of the seedversion version of the component - specific

Tip

specific means it should be assigned value when implementing a component.

Methods:Method Description__init__ initialize general attributes that each component must haverun run the component command locallyfocus update the command and command arguments for each chunk - virtualmake_cmdmake the command used to run the seed of the component. This returns the same command that one

would use to run the component as a stand alone program via command line - virtualtest run unittest of the component - virtual

Tip

The class can be imported from utils module from the kronos package:

from kronos.utils import ComponentAbstract

Component_params

This is a python module and contains the following information:

• input_files: a dictionary with keys being the input file parameters and the values being the default values or aproper flags based on the component UI. For example:

input_files={'samples':['tumour:__REQUIRED__','normal:__REQUIRED__','reference:__REQUIRED__','model:__REQUIRED__'],

'config':'some_default.cfg','positions_file':None}



Note: This dictionary includes only parameters that expect input files or directories.

• output_files: a dictionary with keys being the output file parameters and the values being the default values or aproper flags based on the component ui. For example:

output_files = {'export_features':None,'log_file':'mutationSeq_run.log','out':None}

Note: This dictionary includes only parameters that expect output files or directories.

• input_params: a dictionary with keys being the input non_file parameters and the values being the default valuesor a proper flags based on the component ui. For example:

input_params = {'all':'__FLAG__','buffer_size':'2G','coverage':4,'deep':'__FLAG__','interval':None,'no_filter':'__FLAG__'}

Note: All other parameters that are not included in input_files and output_files should be listed in input_params.

Using the make_component command, the following component_params.py file is generated:

"""component_params.py


## TODO: here goes the list of the input files. Use flags:## '__REQUIRED__' to make it required## '__FLAG__' to make it a flag or switch.input_files = {# 'input_file1' : '__REQUIRED__',# 'input_file2' : None

}

## TODO: here goes the list of the output files.output_files = {# 'output_file1' : '__REQUIRED__',# 'output_file1' : None

}

## TODO: here goes the list of the input parameters excluding input/output files.input_params = {# 'input_param1' : '__REQUIRED__',# 'input_param2' : '__FLAG__',# 'input_param3' : None

}

1.4. Components 25


## TODO: here goes the return value of the component_seed.## DO NOT USE, Not implemented yet!return_value = []

Example: This is an example showing the content of a component_params.py file:

input_files = {'tumour':'__REQUIRED__','normal':'__REQUIRED__','reference':'__REQUIRED__','model':'__REQUIRED__','config':'metadata.config','positions_file':None}

output_files = {'export_features':None,'log_file':'mutationSeq_run.log','out':'__REQUIRED__'}

input_params = {'all':'__FLAG__','buffer_size':'2G','coverage':4,'deep':'__FLAG__','interval':None,'no_filter':'__FLAG__','normalized':'__FLAG__','normal_variant':25,'purity':70,'mapq_threshold':20,'baseq_threshold':10,'indl_threshold':0.05,'manifest':'__OPTIONAL__','single':'__FLAG__','threshold':0.5,'tumour_variant':2,'features_only':'__FLAG__','verbose':'__FLAG__','titan_mode':'__FLAG__'}

Component_reqs

This is a python module and contains the following information:

• env_vars: a dictionary with keys being the name of environment variables and values being the path/content toexport. The values can be updated in the configuration file using env_var in the run subsection. Therefore, it isrecommended not include the paths as values in this file and instead use an empty list, [], or None as a value.

• memory: specifies the minimum memory required by the component to properly run on a cluster. The format isnG, e.g. 30G.

• parallel: a boolean flag that specifies whether or not a component can run in parallel mode.

• requirements: a dictionary with keys usually being the name of a program/software and values being None or theflag __REQUIRED__ . The values will be later updated by kronos using the content of the __GENERAL__section.

• seed_version: the version of the seed.



• version: the version of the component.

Using the make_component command, the following component_reqs.py file is generated:

"""component_reqs.py


## TODO: here goes the list of the environment variables, if any,## required to export for the component to function properly.env_vars = {# 'env_var1' : ['value1', 'value2'],# 'env_var2' : 'value3'

}

## TODO: here goes the max amount of the memory required.memory = '5G'

## TODO: set this to True if the component is parallelizable.parallel = False

## TODO: here goes the list of the required software/apps## called by the component.requirements = {# 'python': '__REQUIRED__',

}

## TODO: here goes the version of the component seed.seed_version = '0.99.0'

## TODO: here goes the version of the component itself.version = '0.99.0'

Example: This is an example showing the content of a component_reqs.py:

env_vars = {'LD_LIBRARY_PATH': []}

memory = '4G'

parallel = True

requirements = {'java': '__REQUIRED__'}

seed_version = 'version 3.2'

version = 'v1.0.1'

Component_ui

It is a python module that contains an argparse UI for the component. Using the make_component command, thefollowing component_ui.py file is generated:

"""component_ui.py

Note the places you need to change to make it work for you.

1.4. Components 27

https://docs.python.org/3/library/argparse.html


They are marked with keyword 'TODO'."""

import argparse

#==============================================================================# make a UI#==============================================================================## TODO: pass the name of the component to the 'prog' parameter and a## brief description of your component to the 'description' parameter.parser = argparse.ArgumentParser(prog='my_comp',

description = """brief description of your component goes here.""")

## TODO: create the list of input options here. Add as many as desired.parser.add_argument(

"-x", "--xparam",default = None,help= """help message goes here.""")

## parse the argument parser.args, unknown = parser.parse_known_args()

Example: This is an example showing the content a component_ui.py:

import sysimport argparse

#==============================================================================# make a UI#==============================================================================parser = argparse.ArgumentParser(prog='snpeff',

description='''Genetic variant annotation and effectprediction toolbox. It annotates and predicts theeffects of variants on genes (such as amino acid changes)''',epilog='''Input file: Default is STDIN''')

# required argumentsrequired_arguments = parser.add_argument_group("Required arguments")

required_arguments.add_argument("--out",default=None,required=True,help='''specify the path/to/out.vcf to save output to a file''')

# mandatory / positional argumentsrequired_arguments.add_argument("genome_version",

choices=['GRCh37.66'],help='''genomic build version''')

required_arguments.add_argument("variants_file",help='''file containing variants''')

# optional optionsoptional_options = parser.add_argument_group("Options")



optional_options.add_argument("-a", "--around",default=False, action="store_true",help='''Show N codons and amino acids around change(only in coding regions). Default is 0 codons.''')

args, unknown = parser.parse_known_args()

Warning: It is required to use parse_known_args instead of parse_args.

component_seed

This is a directory within the component directory where all the source codes of the actual program reside.

1.4.2 Examples

Please refer to our Github repositories for more examples. The repositories with the postfix _workflow are thepipelines and the rest are the components.

1.5 Pipelines

1.5.1 Create a new pipeline

To create a new pipeline using Kronos, you simply need to follow two steps:

1. generate a configuration file using make_config command.

2. configure the pipeline by customizing the resulting configuration file, i.e. pass proper values to the attributes inthe run subsection and set the connections.

Note: If you need a component that does not already exist, then you need to make the component first.

Examples

Please refer to our Github repositories for more examples. The repositories with the postfix _workflow are thepipelines and the rest are the components.

1.5.2 Launch a pipeline

Essentially, you have two options to launch the pipelines generated by Kronos:

1. (Recommended) use run command to initialize and run in one step.

2. use init command to initialize the pipeline first, then run the resulting Python script.

Note: Make sure the version of the Kronos package installed on your machine is compatible with the version used togenerate the configuration file which is shown at the top of the configuration file in __PIPELINE_INFO__ section:

1.5. Pipelines 29

https://github.com/MO-BCCRC?tab=repositorieshttps://github.com/MO-BCCRC?tab=repositories


1. Run the pipeline using run command

It is very easy to run a pipeline using run command:

kronos run -k -c [options]

Input options of run command

This is the list of all the input options you can use with run command:



Option Default Description-h or –help False print help - optional-b or –job_scheduler drmaa job scheduler used to manage jobs on the cluster -

optional-c or–components_dir

None path to components_dir- required

-d or –drmaa_library_path

lib/lx24-amd64/libdrmaa.so path to drmaa_library - optional

-e or–pipeline_name

None pipeline name - optional

-i or –input_samples None path to the input samples file - optional-j or –num_jobs 1 maximum number of simultaneous jobs per pipeline -

optional-k or–kronos_pipeline

None path to Kronos-made pipeline script- optional

-n or–num_pipelines

1 maximum number of simultaneous running pipelines -optional

-p or –python_installation

python path to python executable - optional

-q or –qsub_options None native qsub specifications for the cluster in a single string- optional

-r or –run_id None (current timestampwill be used)

pipeline run id - optional

-s or –setup_file None path to the setup file- optional-w or –working_dir current working directory path to the working directory - optional-y or –config_file None path to the config_file.yaml- optional–no_prefix False switch off the prefix that is added to all the output files

by Kronos - optional

Note: “-c or –components_dir” is required to specify.

On --qsub-options option

There are a few keywords that can be used with --qsub_options option. These keywords are replaced withcorresponding values from the run subsection of each task when the job for that task is submitted:

• mem: will be replaced with memory from run subsection

• h_vmem: will be replaced with 1.2 * memory.

• num_cpus: will be replaced with num_cpus from run subsection

For example:

--qsub_options " -pe ncpus {num_cpus} -l mem_free={mem} -l mem_token={mem} -l h_vmem={h_vmem} [other options]"

Note: If you specify --qsub_options option with hard values (i.e. not using these keywords), they will overwritethe values in the run subsection.

1.5. Pipelines 31


Initialize using run command

If you only have the configuration file and not the pipeline script, you can still use run command. To do so, simplypass the configuration file using -y option. This instructs Kronos to initialize the pipeline first and run the resultingpipeline script subsequently. In this case, you do not have to specify -k option.

Tip

You can use -s and -i when you use -y to input sample file and setup file, respectively.

Warning: If you specify both -y and -k with run command, Kronos would use -y and ignores -k.

Note: When using run command, you cannot initialize only (i.e. without running the pipeline). Use init commandif you only want to make a pipeline script.

Run the tasks locally, on a cluster or in the cloud

When launching a pipeline, each task in the pipeline can individually be run locally or on a cluster. For this you needto use the use_cluster attribute for each task in the configuration file.

You can also launch the pipeline in the cloud. Please refer to the Deploy Kronos to the cloud for more information.

2. Run the pipeline using init command and the resulting pipeline script

You can launch a pipeline by using init command to create a pipeline script first:

kronos -w init -y -e

and then by running the script.

The init command has the followig input options:

Option Default Description-h or –help False print help - optional-e or –pipeline_name None pipeline name - required-i or –input_samples None path to the input samples file - optional-s or –setup_file None path to the setup file- optional-y or –config_file None path to the config_file.yaml- required

Samples file

It is a tab-delimited file that lists the content of SAMPLES section of the configuration file. You can use the inputoption -i to pass this file when using init or run commands.

The content of the file should look like the following:

#sample_id ... ... ...

where:



• the header always start with #sample_id and the rest of it, the ‘s, are the keys used in key:valuepairs.

• the ‘s should be unique ID’s, e.g. DAH498, Rx23D, etc.

• the ‘s are the corresponding values of the keys in the header.

For instance, the following is the content of an actual samples file:

#sample_id bam outputDG123 /genesis/extscratch/data/DG123.bam DG123_analysis.vcfDG456 /genesis/extscratch/data/DG456.bam DG456_analysis.vcf

If this file is passed to the -i option, the resulting configuration file would have a SAMPLES section looking like this:

__SAMPLES__:DG123:

output: 'DG123_analysis.vcf'bam: '/genesis/extscratch/data/DG123.bam'

DG456:output: 'DG456_analysis.vcf'bam: '/genesis/extscratch/data/DG456.bam'

Info

Kronos uses the samples file to update (not to overwrite) SAMPLES section which means that if an ID inthe setup file already exists in the SAMPLES section of the configuration file, the value of the ID is updated.Otherwise, the new sample ID entry is added to the section and the rest of the section remains unchanged.

Setup file

It is a tab-delimited file that lists the key:value pairs that should go in GENERAL or SHARED sections of theconfiguration file. You can use the input option -s to pass this file when using init or run commands.

The content of the file should look like the following:

#section key value

where:

• the header should always be: #section key value (tab-delimited).

• can be either __GENERAL__ or __SHARED__.

For instance, the following is the content of an actual setup file:

#section key value__GENERAL__ python /genesis/extscratch/pipelines/apps/anaconda/bin/python__GENERAL__ java /genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java__SHARED__ reference /genesis/extscratch/pipelines/reference/GRCh37-lite.fa__SHARED__ ld_library_path ['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']

If this file is passed to the -s option, the resulting configuration file would have GENERAL and SHARED sectionslooking like this:

1.5. Pipelines 33


__GENERAL__:python: '/genesis/extscratch/pipelines/apps/anaconda/bin/python'java: '/genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java'

__SHARED__:ld_library_path: "['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']"reference: '/genesis/extscratch/pipelines/reference/GRCh37-lite.fa'

Info

Kronos uses the setup file to update (not to overwrite) GENERAL and SHARED sections which means thatif a key in the setup file already exists in the target section, the value of that key is updated. Otherwise, thekey:value pair is added to the target section and the rest of the pairs in the target section remain unchanged.

Run the pipeline script generated by init command

All the pipeline scripts generated by Kronos init command can also be run as following:

python -c [options]

where my_pipeline.py is the pipeline script you want to run.

Warning: It is required to pass the path of the components_dir to the input option -c when running thepipeline. See What is the components directory? for more information on components_dir.

This is the list of all the input options you can use:

Option Default Description-h or –help False print help - optional-b or –job_scheduler drmaa job scheduler used to manage jobs on the cluster -

optional-c or–components_dir

None path to components_dir- required

-d or –drmaa_library_path

lib/lx24-amd64/libdrmaa.so path to drmaa_library - optional

-e or–pipeline_name

None pipeline name - optional

-j or –num_jobs 1 maximum number of simultaneous jobs per pipeline -optional

-l or –log_file None name of the log file - optional-n or–num_pipelines

1 maximum number of simultaneous running pipelines -optional

-p or –python_installation

python path to python executable - optional

-q or –qsub_options None native qsub specifications for the cluster in a single string- optional

-r or –run_id None (current timestampwill be used)

pipeline run id - optional

-w or –working_dir current working directory path to the working directory - optional–no_prefix False switch off the prefix that is added to all the output files

by Kronos - optional



What is the components directory?

It is the directory where you have cloned/stored all the components. The generated pipeline has the input option -c or--components_dir that requires the path to that directory.

Note: Note that components_dir is always the parent directory that contains the component(s). For ex-ample, if you have a component called comp1 in the path ~/my_components/comp1, you should pass~/my_components to the -c option:

Results generated by a pipeline

When a pipeline is run, a directory is made inside the working directory with its name being the run ID. All the outputfiles and directories are stored here, i.e. in //.

What is the working directory?

It is a directory used by Kronos to store all the resulting files. The user can specify the path to its desired workingdirectory via input option -w.

Tip

If the directory does not exist, then it will be made.

Tip

If you do not specify the working directory, the current directory would be used instead.

What is the run ID?

Each time a pipeline is run, a unique ID is generated for that run unless it is specified using -r option by the user.This ID is used for the following purposes:

• to trace back the run, i.e logs, results, etc.

• to enable re-running the same incomplete run, which it will automatically pick up from where it left off

• to avoid overwriting the results if the same working directory is used for all the runs

Info

The ID generated by Kronos (if -r not specified) is a timestamp: ‘year-month-day_hour-minute-second’.

What is the structure of the results directory generated by a pipeline?

The following tree shows the general structure of the // directory where the resultsare stored:

1.5. Pipelines 35


|-- | |-- _| | |-- logs| | |-- outputs| | |-- scripts| | |-- sentinels| |-- _| | |-- logs| | |-- outputs| | |-- scripts| | |-- sentinels| |-- _.yaml| |-- _.log

where:

• an individual subdirectory is made with name _ for each sample in theSAMPLES section.

• there are always the following four subdirectories in the _ directory:

– logs: where all the log files are stored

– outputs: where all the resulting files are stored

– scripts: where all the scripts used to run the components are stored

– sentinels: where all the sentinel files are stored

If there is not any samples in the SAMPLES section, then a subdirectory with name__shared__only___ is made instead of _. Infact, since there are no ID’s in the SAMPLES section, Kronos uses the string __shared__only__ to idicate thatSAMPLES section is empty.

Note: The developer of the pipeline can customize the content of the outputs directory (see Output directorycustomization for more information). So, you might see more directories inside that directory.

Info

scripts direcotry is used by Kronos to store and manage the scripts and should not be modified.

Info

Sentinel files mark the successful completion of a task in the pipeline. sentinels directory is simply used forstoing these files.

How can I relaunch a pipeline?

If you have run a pipeline and it has stopped at some point for any reason, e.g. a breakpoint or an error, you can re-runit from where it left off. For that purpose, simply use the exact same command you used in the first place but onlymake sure that you also pass the run ID of the first run to the input option -r.



Note: If you forget to pass the run ID or pass a nonexistent run ID by mistake, Kronos considers that as a new runand launches the pipeline from scratch. This will not overwrite your previous results.

Tip

If you want to relaunch a pipeline from an arbitrary task (that already has a sentinel file), you need to goto the sentinels directory and delete the sentinel file corresponding to that task. Then relaunch the pipelineas mentioned above. Remember that all the next tasks that have connections to this task will also be re-runregardless of whether or not they have a sentinel file. The reason for this is that Kronos checks the timestampof the sentinels and if the sentinels of the next task are outdated compared to the current task, it will re-run themtoo.

Tip

If you want to run a part of a pipeline between two tasks (two breakpoints) for several times, each time youneed to delete the sentinel files of the tasks between the two breakpoints as well as the sentinel file of the secondbreakpoint. In the new version, we’re working on making this easier by eliminatig the need to delete thesesentinels each time.

Tip

A sentinel file name looks like TASK_i__sentinel_file. For the breakpoints, the sentinel file name lookslike __BREAK_POINT_TASK_i__sentinel_file.

1.6 Guides

1.6.1 Quick tutorial

This section can help you quickly make a component, make a pipeline, and run the pipeline step by step using somaticvariant caller Strelka. You need to refer to the rest of the documentation for details.

First, you need to download the strelka_workflow.tar.gz tarball from our ftp server:ftp://ftp.bcgsc.ca/public/shahlab/kronos, and unpack it into a desired directory, say $HOME_DIR:

$ cd $HOME_DIR$ tar -xvf strelka_workflow.tar.gz

This will make a directory called strelka_workflow in $HOME_DIR.

The test input data is a lightweight pair of cell line tumour/normal bam files that can be downloaded from here:

• Exome normal bam file

• Exome tumour bam file

We are using this lightweight pair of bam files so that this can be done on a local computer with decent amount ofmemory ( >=8G).

1.6. Guides 37

https://sites.google.com/site/strelkasomaticvariantcaller/https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395/gerald_C1TD1ACXX_7_CGATGT.bamhttps://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395/gerald_C1TD1ACXX_7_ATCACG.bam


Make a component

We are going to make a component called plot_strelka which will later be used in the next section. The seed of thiscomponent is an R script called plot_strelka.R which is included in the tarball in the seeds directory. It takesa Strelka result file as an input and generates a plot.

Note: Refer to the Components for more information on the definition of seed and component.

The seed can be run by the following command:

$ R --no-save --args < plot_strelka.R

where and should be replaced with a Strelka result file and a name for the outputfile, respectively.

We can make a component for this seed by following steps 1 to 6:

Step 1. Make a component template:

$ kronos make_component plot_strelka

This will make a component template called plot_strelka in the current working directory.

Step 2. Copy the seed to the plot_strelka/component_seed directory.

Step 3. Open the plot_strelka/component_main.py template file and add the following lines to the begin-ning of the make_cmd function:

cmd = self.requirements['R'] + ' --no-save --args 'cmd_args = [self.args.infile, self.args.outfile_name]cmd_args.append('






Therefore the final component_main.py would look like:

"""component_main.pyThis module contains Component class which extendsthe ComponentAbstract class. It is the core of a component.


from kronos.utils import ComponentAbstractimport os

class Component(ComponentAbstract):

"""TODO: add component doc here."""

def __init__(self, component_name="plot_strelka",component_parent_dir=None, seed_dir=None):

## TODO: pass the version of the component here.self.version = "v0.99.0"

## initialize ComponentAbstractsuper(Component, self).__init__(component_name,

component_parent_dir, seed_dir)

## TODO: write the focus method if the component is parallelizable.## Note that it should return cmd, cmd_args.def focus(self, cmd, cmd_args, chunk):

pass# return cmd, cmd_args

## TODO: this method should make the command and command arguments## used to run the component_seed via the command line. Note that## it should return cmd, cmd_args.def make_cmd(self, chunk=None):

## TODO: replace 'comp_req' with the actual component## requirement, e.g. 'python', 'java', etc.cmd = self.requirements['R'] + ' --no-save --args 'cmd_args = [self.args.infile, self.args.outfile_name]cmd_args.append('


cmd, cmd_args = self.focus(cmd, cmd_args, chunk)


## To run as stand alonedef _main():

c = Component()c.args = component_ui.argsc.run()

if __name__ == '__main__':import component_ui_main()

Step 4. As mentioned earlier, the seed takes as an input a Strelka result file and a name for the output file.We have chosen the names infile and outfile_name to represent these inputs, respectively. So, open theplot_strelka/component_params.py template file and add the names to it as follows:

"""component_params.py


## TODO: here goes the list of the input files. Use flags:## '__REQUIRED__' to make it required## '__FLAG__' to make it a flag or switch.input_files = {

'infile' : '__REQUIRED__',# 'input_file2' : None

}

## TODO: here goes the list of the output files.output_files = {

'outfile_name' : '__REQUIRED__',# 'output_file1' : None

}

## TODO: here goes the list of the input parameters excluding input/output files.input_params = {# 'input_param1' : '__REQUIRED__',# 'input_param2' : '__FLAG__',# 'input_param3' : None

}

## TODO: here goes the return value of the component_seed.## DO NOT USE, Not implemented yet!return_value = []

Note: You only need to change the following two lines:

'infile' : '__REQUIRED__','outfile_name' : '__REQUIRED__',

Step 5. Open the plot_strelka/component_reqs.py template file and only change the following line:



requirements = {# 'python': '__REQUIRED__',

}

to this:

requirements = {'R': '__REQUIRED__',

}

The rest of the fields in this file can also be changed if desired but is not required.

Step 6 (Optional). This step can be skipped. It is only needed if you want to run the component as standalone outsideof a pipeline. This step creates a user interface for the component. Open the plot_strelka/component_ui.pytemplate file and change it so it looks like:

"""component_ui.py


import argparse

#==============================================================================# make a UI#==============================================================================## TODO: pass the name of the component to the 'prog' parameter and a## brief description of your component to the 'description' parameter.parser = argparse.ArgumentParser(prog='plot_strelka',

description = """creates a plot from Strelka results.""")

## TODO: create the list of input options here. Add as many as desired.parser.add_argument(

"--infile",default = None,required = True,help= """input file.""")

parser.add_argument("--outfile_name",default = None,required = True,help= """a name for the output file.""")

## parse the argument parser.args, unknown = parser.parse_known_args()

Make a pipeline

This section explains how to create a pipeline that runs Strelka and plots its results. For this purpose, weneed to use the components included in the strelka_workflow.tar.gz tarball. There are two compo-

1.6. Guides 41

https://sites.google.com/site/strelkasomaticvariantcaller/


nents called run_strelka and plot_strelka in $HOME_DIR/strelka_workflow/components where$HOME_DIR is where you unpacked the tarball. You can also use the component we created in Make a componentfor plot_strelka.

Next, export the components path to PYTHONPATH environment variable:

export PYTHONPATH=$HOME_DIR/strelka_workflow/components:$PYTHONPATH

Now, we can start making a new pipeline using these component:

Step 1. Make a new configuration file using the make_config command:

kronos make_config run_strelka plot_strelka -o strelka_workflow

This will create a new configuration file called strelka_workflow.yaml in the current working directory. Thisfile has a number of sections which we go through step by step. Please refer to Configuration file to learn more abouteach section.

Step 2 (optional). The first section in the configuration file is __PIPELINE_INFO__ and contains informationregarding the pipeline. Here is an example for this configuration file:

name: 'run_plot_strelka'version: '1.0'author: 'Jafar Taghiyar'data_type: 'SNV'input_type: 'bam'output_type: 'vcf, jpeg'host_cluster: 'local'date_created: '2016-01-04'date_last_updated:Kronos_version: '2.0.4'

This section is only informative and does not have any effects on the pipeline.

Step 3. The second section is __GENERAL__ that lists the requirements of all the components in the pipeline. Therequirements listed in the __GENERAL__ section apply to all the components in the pipeline.

In this pipeline, it looks like this:

strelka: '__REQUIRED__'R: '__REQUIRED__'perl: '__REQUIRED__'

These entries are required. However, these values can come from a setup file when running the pipeline (see How torun the pipeline). So, for now you do not need to pass values to them in the configuration file.

Info

Each component can also have its own requirements specified individually. However, in this quick tutorial youneed to simply leave them blank. For more information refer to here.

Step 4. The next section is __SHARED__ where we can create variables.

In this pipeline, we add the following variable to this section:

__SHARED__:strelka_ref: #a reference genome

Similar to __GENERAL__ section, the value for this entry can come from the setup file when running the pipeline(see How to run the pipeline). In Step 6, we will see how we use it.



Step 5. Next is __SAMPLES__ section that can be used to list the input files or parameters. By default the sectionlooks like:

__SAMPLES__:# sample_id:# param1: value1# param2: value2

In this pipeline, the input files are a pair of tumour/normal bam files. Also, we choose a parameter from Strelkacomponent called min_tier2_mapq to included as input in this section to show the functionality of the section.

The content of this section can be provided in an input file when running the pipeline (see How to run the pipeline).So, a user does not need to pass values here.

In Step 6, we will see how we use the content of this section.

Step 6. The rest of the configuration file contains __TASK__ sections. These sections are where the connectionsamong different components in the pipeline are specified. We also need to pass proper values to all the parametersin these sections that have __REQUIRED__ keyword as input. For example, in this pipeline, we have the followingentries with __REQUIRED__ as their values that we need to pass actual values to:

• in __TASK_1__ section:

tumour: __REQUIRED__ref: __REQUIRED__normal: __REQUIRED__output_dir: __REQUIRED__

• in __TASK_2__ section:

infile: __REQUIRED__outfile_name: __REQUIRED__

Some of these entries will be filled when specifying the flow of the pipeline. For example, we like to get the input forthe first task from an input file, i.e. from __SAMPLES__ section. For this purpose, we use IO connections:

__TASK_1__:..

component:input_files:

tumour: ('__SAMPLES__', 'tumour')..normal: ('__SAMPLES__', 'normal')

Next, we want the second task, __TASK_2__, to get its input from the output of the first task, __TASK_1__.Therefore, we simply pass the name of the output file from Strelka in the first task to the infile parameter of thesecond task:

__TASK_2__:..


infile: passed.somatic.snvs.vcf

and then we add the __TASK_1__ to the forced_dependecies of __TASK_2__ which makes __TASK_2__to wait for __TASK_1__ to finish first:

1.6. Guides 43


__TASK_2__:..

run:..

forced_dependencies: ['__TASK_1__']

Note: Since Strelka software enforces the name of its result file to be passed.somatic.snvs.vcf, we need to use theexact same name in the configuration file and then use the forced_dependencies. Otherwise, an IO connectioncould help automatically pass the output of one task to the input of another task. It also manages the dependenciesautomatically.

So far, We have already passed desired values to tumour, normal, and infile. For output_dirand outfile_name we only need to pick names. Let’s choose strelka_output andresults/passed.somatic.snvs.pdf for them, respectively:

__TASK_1__:..

component:..

output_files:output_dir: strelka_output

__TASK_2__:..

component:..

output_files:outfile_name: results/passed.somatic.snvs.pdf

Note: The results/ in results/passed.somatic.snvs.pdf instructs Kronos to make a directory calledresults and copy the result file passed.somatic.snvs.pdf there.

Since we want to enable a user to pass a reference genome in the setup file, i.e. without having to change the config-uration file, we pass it as a variable in __SHARED__ section (see Step 4) and use a connection to refer to it in theconfiguration file:

__TASK_1__:..

component:input_files:..

ref: ('__SHARED__','strelka_ref')

All the connections will be automatically replaced in the runtime.

Step 7. In Step 5, we chose the min_tier2_mapq parameter to be included in the __SAMPLES__ section. There-fore, in order to use it we need to add a connection as follows:



__TASK_1__:..

component:parameters:..

min_tier2_mapq: ('__SAMPLES__','mapq2')

Note: mapq2 in the above line, is only an arbitrary key that we choose and it can be a different name. This key isused in the input file when running the pipeline later in Run a pipeline.

The final configuraion file looks like this:

__PIPELINE_INFO__:name: 'run_plot_strelka'version: '1.0'author: 'Jafar Taghiyar'data_type: 'SNV'input_type: 'bam'output_type: 'vcf, jpeg'host_cluster: 'local'date_created: '2016-01-04'date_last_updated:Kronos_version: '2.0.4'

__GENERAL__:strelka: '__REQUIRED__'R: '__REQUIRED__'perl: '__REQUIRED__'

__SHARED__:strelka_ref: #a reference genome

__SAMPLES__:# sample_id:# param1: value1# param2: value2

__TASK_1__:reserved:

# do not change this section.component_name: 'run_strelka'component_version: '1.2.0'seed_version: '1.0.13'

run:# NOTE: component cannot run in parallel mode.use_cluster: Truememory: '10G'num_cpus: 1forced_dependencies: []add_breakpoint: Falseenv_vars:boilerplate:requirements:

strelka:perl:


tumor: "('__SAMPLES__', 'tumour')"

1.6. Guides 45


config:ref: "('__SHARED__','strelka_ref')"normal: "('__SAMPLES__', 'normal')"

parameters:skip_depth_filters: "('__SHARED__','strelka_exome')"min_tier1_mapq: 20num_procs: 8min_tier2_mapq: "('__SAMPLES__','mapq2')"

output_files:output_dir: 'strelka_output'

__TASK_2__:reserved:

# do not change this section.component_name: 'plot_strelka'component_version: '0.99.0'seed_version: '0.99.0'

run:# NOTE: component cannot run in parallel mode.use_cluster: Truememory: '5G'num_cpus: 1forced_dependencies: ['__TASK_1__']add_breakpoint: Falseenv_vars:boilerplate:requirements:

R:component:

input_files:infile: 'passed.somatic.snvs.vcf'

parameters:output_files:

outfile_name: 'results/passed.somatic.snvs.pdf'

Run a pipeline

In this section, we are going to run the simple tumour/normal pair single nucleotide variant calling pipelnie that wemade in Make a pipeline. This pipeline consists of two tasks:

• task 1: runs Strelka on a pair of tumour and normal bam files.

• task 2: creates a series of plots from Strelka output.

Requirements

• Python >= v2.7.6

• Strelka == v1.0.14

• Java >= v1.7.0_06

• Perl >= v5.8.8+

How to run the pipeline

Step 1. Create a file cal

kronos Documentation - Read the Docskronos Documentation, Release 2.2.0 Kronosis a highly ﬂexible...

Documents

Transcript of kronos Documentation - Read the Docskronos Documentation, Release 2.2.0 Kronosis a highly ﬂexible...