kronos Documentation - Read the Docskronos Documentation, Release 2.2.0 Kronosis a highly flexible...

56
kronos Documentation Release 2.2.0 M. Jafar Taghiyar July 15, 2016

Transcript of kronos Documentation - Read the Docskronos Documentation, Release 2.2.0 Kronosis a highly flexible...

  • kronos DocumentationRelease 2.2.0

    M. Jafar Taghiyar

    July 15, 2016

  • Contents

    1 Table of Contents 31.1 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.2 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Install . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Kronos package . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.1 Kronos features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Kronos commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    make_component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5make_config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5update_config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    1.3 Configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.1 Pipeline_info section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3.2 General section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3.3 Shared section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10IO connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    1.3.4 Samples section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.5 Task section . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    Reserved subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Run subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    use_cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13num_cpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13forced_dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13add_breakpoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13env_var . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15parallel_run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15parallel_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16interval_file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Component subsection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171.3.6 More on the configuration file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    configuration file flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17configuration file keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    i

  • configuration file reserved keywords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18Output directory customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    1.4 Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.4.1 Develop a component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    Component_main . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20focus method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22make_cmd method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22ComponentAbstract class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    Component_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24Component_reqs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Component_ui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27component_seed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    1.4.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291.5 Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    1.5.1 Create a new pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    1.5.2 Launch a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291. Run the pipeline using run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    Input options of run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30On --qsub-options option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Initialize using run command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Run the tasks locally, on a cluster or in the cloud . . . . . . . . . . . . . . . . . . . . 32

    2. Run the pipeline using init command and the resulting pipeline script . . . . . . . . . . 32Samples file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32Setup file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Run the pipeline script generated by init command . . . . . . . . . . . . . . . . . . 34What is the components directory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    Results generated by a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the working directory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the run ID? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35What is the structure of the results directory generated by a pipeline? . . . . . . . . . 35How can I relaunch a pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    1.6 Guides . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371.6.1 Quick tutorial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Make a component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38Make a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41Run a pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

    Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46How to run the pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    More Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471.6.2 Deploy Kronos to the cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    Setup StarCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Installing StarCluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49Creating an EBS volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Launching your cloud cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Setup Kronos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511.7 License . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521.8 Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    1.8.1 Questions and feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    ii

  • kronos Documentation, Release 2.2.0

    Kronos is a highly flexible Python-based software tool that mainly enables bioinformatics developers, i.e. bioin-formaticians who develop workflows for analyzing genomic data, to quickly make a workflow. It uses Ruffus as theunderlying workflow management system and adds a level of abstraction on top of it, which significantly reducesprogramming overhead for workflow development and provides a mechanism to represent a workflow by a top levelYAML configuration file.

    Each resulting workflow is portable, can run either locally or on a cluster, parallelizes tasks automatically, and logsall runtime events. The workflows are also highly modular and can be easily updated by editing their correspondingconfiguration files.

    Kronos is free and open source under the MIT license and has a Docker image too. Although we have developed itwith bioinformaticians in mind, Kronos can be used in any area where a workflow is required to run a series of taskson a given set of input data.

    Note: Throughout this documentation we will use workflow and pipeline interchangeably.

    Contents 1

    http://www.ruffus.org.uk/http://yaml.org/https://hub.docker.com/r/jtaghiyar/kronos/

  • kronos Documentation, Release 2.2.0

    2 Contents

  • CHAPTER 1

    Table of Contents

    1.1 Getting started

    1.1.1 Download

    You can get the Kronos package from PyPi. You can also clone it from GitHub repository.

    1.1.2 Dependencies

    You need to have the following dependencies installed:

    Program/package Versionpython 2.7.*ruffus 2.4.1PyYaml 3.11drmaa-python 0.7.6

    where drmaa-python is optional and you will need to install it only if you want to use -b drmaa when runninga workflow on a cluster.

    Info

    PyYaml and ruffus are installed as dependencies when installing Kronos. So, you would only need toinstall drmaa-python.

    1.1.3 Install

    You can install Kronos using pip:

    pip install kronos-pipeliner

    or upgrade it by:

    pip install --upgrade kronos-pipeliner

    3

    https://pypi.python.org/pypi/kronos-pipelinerhttps://github.com/jtaghiyar/kronos

  • kronos Documentation, Release 2.2.0

    Tip

    For a quick start, without having to go through all the details, you can directly refer to Quick tutorial.

    1.2 Kronos package

    This section explains:

    • Kronos features

    • Kronos commands

    1.2.1 Kronos features

    Kronos package offers the following features which eliminate the difficulties of making a pipeline:

    Info

    We define a pipeline (also called workflow) as a DAG composed of different tasks.

    • Single configuration file: the whole pipeline can be configured using a single configuration file.

    • Parallelization: parallelizable tasks are automatically run in parallel.

    • Synchronization: parallel tasks can be synchronized based on any of their parameters.

    • Local, cluster and cloud support: the pipelines can be run locally or on a cluster of computing nodes or in thecloud.

    • Forced dependencies: any task can be forced to wait for any other tasks.

    • Breakpoints: a pipeline can be programmatically paused and restated from any point in the pipeline.

    • Boilerplates: an executable boilerplate or script can be injected to a task and is run prior to running the taskitself.

    • Keywords: a set of specific keywords in the configuration file which will be automatically replaced by propervalues in the runtime.

    • Parameter sweep: a pipeline can be run for a list of different values for a set of input arguments.

    • Output directory customization: the structure of the output directory where all the intermediate files and resultsare stored can be configured by the user in the configuration file.

    • Event logging: all the events are automatically logged.

    1.2.2 Kronos commands

    Once Kronos is installed, it is added to the PATH, i.e. kronos becomes an available command which has the followingsub-commands:

    4 Chapter 1. Table of Contents

    http://en.wikipedia.org/wiki/Directed_acyclic_graph

  • kronos Documentation, Release 2.2.0

    Command Descriptionmake_component make a new component templatemake_config make a new configuration fileupdate_config copy the fields of old configuration file to new configuration fileinit initialize a pipeline from the given configuration filerun run Kronos-made pipelines with optional initialization

    as well as the following options:

    Options Description-h or –help print help - optional-v or –version show program’s version number and exit - optional-w or –working_dir path/to/working_dir - optional

    Tip

    The -w is optional and if not specified, the current working directory is used to save output files/directories. Itis recommended to specify it to avoid overwriting existing files. See What is the working directory? for moreinformation.

    make_component

    This command creates a new component template. In other words, it automatically generates wrappers required for aseed to become a component.

    Info

    See Components for more information on seed and component.

    The command is used as follows:

    kronos -w make_component

    For example, the following code creates a component template called my_comp in a directory calledmy_components_dir:

    kronos -w my_components_dir make_component my_comp

    make_config

    This command makes a new configuration file for the given list of component names.

    The command is used as follows:

    kronos -w make_config -o

    For example, the following code creates a new configuration file called my_config_file.yaml for two compo-nents comp1 and comp2 in a directory called my_working_dir:

    kronos -w my_working_dir make_config comp1 comp2 -o my_config_file

    1.2. Kronos package 5

  • kronos Documentation, Release 2.2.0

    Warning: It is required to export the path of the components directory to the PYTHONPATH environment variableprior to running the make_config command:

    export PYTHONPATH=:$PYTHONPATH

    Tip

    Note that the suffix .yaml is automatically added to the end of the provided name for the configuration file.

    update_config

    This command replaces the corresponding fields of an old configuration file with that of a new one. This is usefulwhen there is a large configuration file which needs to be updated.

    The command is used as follows:

    kronos -w update_config -o

    For example, the following code creates a new configuration file called new_config_file.yaml by updatingmy_config_file1.yaml using my_config_file2.yaml in a directory called my_working_dir:

    kronos -w my_working_dir update_config my_config_file1.yaml my_config_file2.yaml -o new_config_file

    init

    This command initializes a new pipeline (i.e. creates a Python script) based on the input configuration file.

    Info

    We call a resulting Python script a pipeline script too.

    The command is used as follows:

    kronos -w init -y -e

    For example, the following code creates a Python script called my_pipeline.py for the input configuration filemy_config_file.yaml in a directory called my_working_dir:

    kronos -w my_working_dir init -y my_config_file.yaml -e my_pipeline

    The output Python script of this command can be run using Kronos run command or can be run directly as a Pythonscript.

    Info

    See How to initialize a pipeline? for more information.

    Tip

    Note that the suffix .py is automatically added to the end of the provided name for the pipeline.

    6 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Warning: The init command might create the following directories in addition to the pipeline Python script:• intermediate_config_files• intermediate_pipeline_scripts

    These directories are used by Kronos and users should NOT modify them.

    run

    This command runs Kronos-made pipelines, i.e. pipeline scripts made by init command.

    The command is used as follows:

    kronos run -k -c [options]

    Warning: It is required to export the path of the components directory to the PYTHONPATH environment variableprior to running the run command:

    export PYTHONPATH=:$PYTHONPATH

    Info

    You can use run command to initialize and run the pipeline using the configuration file directly (i.e. without theneed to init first). See Run the pipeline using run command for more information.

    1.3 Configuration file

    A configuration file generated by Kronos is a YAML file which describes a pipeline. It contains all the parametersof all the components in the pipeline as well as the information that builds the flow of the pipeline.

    A configuration file has the following major sections (shown in red ovals in the following figure):

    • __PIPELINE_INFO__

    • __GENERAL__

    • __SHARED__

    • __SAMPLES__

    • __TASK_i__

    where __TASK_i__ has the following subsections (shown in green ovals in the following figure):

    • reserved

    • run

    • component

    1.3.1 Pipeline_info section

    The __PIPELINE_INFO__ section stores information about the pipeline itself and looks like the following:

    1.3. Configuration file 7

    http://yaml.org/

  • kronos Documentation, Release 2.2.0

    8 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    __PIPELINE_INFO__:name: nullversion: nullauthor: nulldata_type: nullinput_type: nulloutput_type: nullhost_cluster: nulldate_created: nulldate_last_updated: nullkronos_version: '2.0.0'

    where:

    • name: a name for the pipeline

    • version: version of the pipeline

    • author: name of the developer of the pipeline

    • data_type: this can be used for database purposes

    • input_type: type of the input files to the pipeline

    • output_type: type of the output files of the pipeline

    • host_cluster: a name a cluster used to run the pipeline or ‘null’ if the pipeline is designed to run onlylocally

    • date_created: date that the pipeline is created

    • date_last_updated: last date that the pipeline is updated

    • kronos_version: version of the kronos package that has generated the configuration file and is addedautomatically

    Info

    All these fields are merely informative and do not have any impacts on the flow of the pipeline.

    1.3.2 General section

    __GENERAL__ section contains key:value pairs derived automatically from the requirements field ofthe Component_reqs file of the components in the pipeline. Each key corresponds to a particular requirement,e.g. Python, java, etc., and each value is the path to where the key is. For instance, if there is python:/usr/bin/python entry in the requirements of a component in the pipeline, then you would have the follow-ing in the __GENERAL__ section:

    __GENERAL__:python: '/usr/bin/python'

    Now, let assume there is another Python installations on your machine in /path/my_python/bin/python andyou prefer to use this instead. You can simply change the path to the desired one:

    __GENERAL__:python: '/path/my_python/bin/python'

    1.3. Configuration file 9

  • kronos Documentation, Release 2.2.0

    Warning: This will overwrite the path of python installation specified in the requirements of ALL the compo-nents , hence the name GENERAL. If you want to change the path for only one specific task, then you should usethe requirements entry in the run subsection of that task. Note that the task’s requirements entry takes precedenceover the __GENERAL__ section.

    1.3.3 Shared section

    In __SHARED__ section you can define arbitrary key:value pairs and then use the keys as variables in the tasksections. This helps you to parameterize the task sections. The mechanism that enables you to use variables is calledconnection.

    Connections

    A connection is simply a tuple, i.e. (x1, x2), where its first entry is always a section name, e.g. __SHARED__, andthe second entry is a key in that section, e.g. (’__SHARED__’, ’key1’) which means: ‘use the value assignedto the key1 in the __SHARED__ section’. For example, in the following configuration file, the value of the parameterreference of __TASK_1__ will be ’GRCh37-lite.fa’ at runtime:

    __SHARED__:ref: 'GRCh37-lite.fa'

    __TASK_1__:component:

    input_files:reference: ('__SHARED__', 'ref')

    Tip

    A connection to the __SHARED__ section, i.e. its first entry is __SHARED__, is called a shared connection.

    Tip

    It is recommended to use shared connections for the parameters in different tasks that expect the same valuefrom users.

    IO connection

    An IO connection is a connection whose first entry is a task name and its second entry is a parameter of that task, e.g.(’__TASK_n__’, ’param1’) where param1 is a parameter in __TASK_n__. For instance, in the followingconfiguration, (’__TASK_1__’, ’out_file’) is an IO connection which points to the out_file parameterof __TASK_1__. This connection means: ‘use the value assigned to the out_file parameter of __TASK_1__for the in_file parameter of __TASK_2__. The value of the parameter in_file of __TASK_2__ will be’some_file’ at runtime.

    __TASK_1__component:

    out_file: 'some_file'

    __TASK_2__

    10 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    component:in_file: ('__TASK_1__', 'out_file')

    1.3.4 Samples section

    __SAMPLES__ section contains key:value pairs with a unique ID for each set of the pairs. It enables users to runthe same pipeline for different sets of input arguments at once, i.e. users can perform parameter sweep. kronos willrun the pipeline for all the sets simultaneously, i.e. in parallel mode.

    For example, for the following configuration file, kronos will make two intermediate pipelines and runs them inparallel. In one of the intermediate pipelines the values of tumour and normal parameters of __TASK_1__ are’DAX1.bam’ and ’DAXN1.bam’, respectively, while in the other one they are ’DAX2.bam’ and ’DAXN2.bam’,respectively.

    __SAMPLES__:ID1:

    tumour: 'DAX1.bam'normal: 'DAXN1.bam'

    ID2:tumour: 'DAX2.bam'normal: 'DAXN2.bam'

    __TASK_1__:component:

    input_files:tumour: ('__SAMPLES__', 'tumour')normal: ('__SAMPLES__', 'normal')

    The ID of each set of input arguments, e.g. ID1 or ID2, is used by kronos to create intermediate pipelines.

    Warning: Each ID in the __SAMPLES__ section must be unique, otherwise their corresponding results will beoverwritten.

    Warning: kronos creates the following directories in the working directory to store the intermediate pipelines:• intermediate_config_files• intermediate_pipeline_scripts

    Users should NOT modify them.

    Tip

    A connection to the __SAMPLES__ section, i.e. its first entry is __SAMPLES__, is called a sample connection.

    The differences between __SAMPLES__ and __SHARED__ sections are:

    • a unique ID is required in the __SAMPLES__ section for each set

    • a separate individual pipeline is generated for each set of key:value pairs, i.e. for each ID, in the__SAMPLES__ section

    Tip

    The number of simultaneous parallel pipelines can be set by the user when running the pipeline using the inputoption -n.

    1.3. Configuration file 11

  • kronos Documentation, Release 2.2.0

    1.3.5 Task section

    Each task section in a configuration file corresponds to a component. The name of a task section follows the convention__TASK_i__ where i is a number used to make the name unique, e.g. __TASK_1__ or __TASK_27__. If atask is run in parallel then there will be sections with names __TASK_i_j__ which refer to the children of task__TASK_i__, e.g. __TASK_1_1__, __TASK_1_2__, etc. Each task section has following subsections:

    • reserved

    • run

    • component

    Reserved subsection

    This subsection contains information about the component of the task:

    reserved:# do not change this sectioncomponent_name: 'name_of_component'component_version: 'version_of_component'seed_version: 'version_of_seed'

    Warning: The information in this subsection should NOT be altered by users and are automatically specified bykronos.

    Run subsection

    This subsection is used to instruct the kronos how to run the task. It looks like the following example:

    run:use_cluster: Falsememory: '5G'num_cpus: 1forced_dependencies: []add_breakpoint: Falseenv_vars:boilerplate:requirements:parallel_run: Falseparallel_params: []interval_file:

    use_cluster

    You can determine if each task in a pipeline should be run locally or on a cluster using the boolean flaguse_cluster. Therefore, in a single pipeline some tasks might be run locally while the others are submitted toa cluster.

    Warning: If use_cluster: True, then pipeline should be run on a grid computer cluster. Otherwise you’llsee the error message failed to load ClusterJobManager and pipeline would eventually fail.

    12 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Warning: If use_cluster: True, make sure you pass the correct path for the drmaa library specified by -doption (see Input options of run command for more information on input options). The default value for -d optionis $SGE_ROOT/lib/lx24-amd64/libdrmaa.so where SGE_ROOT environment variable is automaticallyadded to the path, so you only need to specify the rest of the path if it is different than the default value.

    memory

    If you submit a task to a cluster, i.e. use_cluster: True, then memory specifies the maximum amount ofmemory requested by the task.

    num_cpus

    If you submit a task to a cluster, i.e. use_cluster: True, then num_cpus specifies the number of coresrequested by the task.

    forced_dependencies

    You can force a task to wait for some other tasks to finish by simply passing the list of their names to the attributeforced_dependencies of the task. For example, in the following config __TASK_1__ is forced to wait for__TASK_n__ and __TASK_m__ to finish running first.

    __TASK_1__:run:

    forced_dependencies: [__TASK_n__, __TASK_m__]

    Tip

    forced_dependencies always expects a list, e.g. [], [__TASK_n__], [__TASK_n__, __TASK_m__].

    Info

    A dependency B for task A means that task A must wait for task B to finish first, then task A starts to run.

    Info

    If there is an IO connection between two tasks, then an implicit dependency is inferred by kronos.

    add_breakpoint

    A breakpoint forces a pipeline to pause. If add_breakpoint: True for a task, pipeline will stop running afterthat task is done. Once the pipeline is relaunched, it will resume running from where it left off. This mechanism has anumber of applications:

    • if a part of a pipeline needs user’s supervision, for example to visually inspect some output data, then adding abreakpoint can pause the pipeline for the user to make sure everything is as desired and the relaunch from thatpoints.

    1.3. Configuration file 13

  • kronos Documentation, Release 2.2.0

    • you can run a part of a pipeline several times, for example to fine tune some of the input arguments. This canhappen by adding breakpoint to the start and end tasks for that part of the pipeline and relaunch the pipelineevery time.

    • you can run different parts of a single pipeline on different machines or clusters provided that the pipeline canaccess the files generated by the previous runs. For instance, you can run a pipeline locally up to some point (abreakpoint) and then relaunch the pipeline on a different machine or cluster to finish the rest of the tasks.

    Tip

    If a task is parallelized and it has add_breakpoint: True, then the pipeline waits for all the children ofthe task to finish running and then applies the breakpoint.

    Note: When a breakpoint happens, all the running tasks are aborted.

    env_var

    You can specify a list of the environment variables, required for a task to run successfully, directly in the configurationfile. It looks like the following:

    __TASK_n__:run:

    env_vars:var1: value1var2: value2

    Tip

    If an environment variable accepts a list of values, you can pass a list to that environment variable. For example:

    env_vars:var1: [value1, value2, ...]

    boilerplate

    Using this attribute you can insert a command or an script, or in general a boilerplate, directly into a task. Theboilerplate is run prior to running the task. For example, assume you need to setup your python path using loadmodule command. You can either pass the command as follows:

    __TASK_n__:run:

    boilerplate: 'load module python/2.7.6'

    or save it in a file e.g. called setup_file:

    load module python/2.7.6

    and pass the path to the file, e.g. /path/to/setup_file, to the boilerplate attribute:

    __TASK_n__:run:

    boilerplate: /path/to/setup_file

    14 Chapter 1. Table of Contents

    http://modules.sourceforge.net/http://modules.sourceforge.net/

  • kronos Documentation, Release 2.2.0

    requirements

    Similar to the __GENERAL__ section, this entry contains a list of key:value pairs derived automatically from therequirements field of the Component_reqs file of the component. The difference is that this list contains only therequirements for this task and applies only to this task and not the rest of the tasks in the pipeline.

    It looks like the following:

    __TASK_n__:run:

    requirements:req1: value1req2: value2

    Tip

    This entry takes precedence over the __GENERAL__ section. If you want to get the values for the requirementsfrom the __GENERAL__ section, then simply leave the value for each requirements in this entry blank or passnull. For example:

    requirements:req1:req2:

    parallel_run

    It is a boolean flag that specifies whether or not to run a task in parallel. If parallel_run: True, the task isautomatically expanded to a number of children tasks that are run in parallel simultaneously.

    Warning: A task needs to be parallelizable to run in parallel.

    Tip

    If a task is not parallelizable, the attributes parallel_run, parallel_params and interval_file will NOT be shownin the run subsection and the following message is shown in the configuration file under the run subsection ofthe task:NOTE: component cannot run in parallel mode.Otherwise it is considered parallelizable.

    There are two mechanisms a task is parallelized: parallelization

    In this mechanism, the task is expanded to its children where the number of its children is determined by one of thefollowing:

    • number of lines in the interval_file

    • number of chromosomes, if there is no interval file specified

    Tip

    kronos uses the set [1, 2,..., 22, X, Y] for chromosome names and it parallelizes a task based on this set bydefault if no interval file is specified.

    1.3. Configuration file 15

  • kronos Documentation, Release 2.2.0

    synchronization

    If a task has:

    • an IO connection to a second task

    • and, the parallel_params is also set

    then kronos expands the first task as many times as the number of the children of the second task, if the two tasksare synchronizable.

    Tip

    Two tasks are synchronizable if:1. both are parallelizable, and2. if they are both parallelized, they have the same number of children, and

    Note: If any of the conditions mentioned above does not hold true, then kronos automatically merges the resultsfrom the predecessor task and passes the result to the next task.

    Tip

    If task A is synchronizable with both tasks B and C individually but not simultaneously, then kronos synchro-nizes task A with one of them and uses the merge for the other one.

    parallel_params

    This attribute controls:

    • whether to synchronize a task with its predecessor(s)

    • over what parameters the synchronization should happen

    It accepts a list of parameters of the task that have IO connection to the predecessors. For instance, if task__TASK_n__ has task __TASK_m__ as its predecessor and has two IO connections with it, e.g. in_param1:(__TASK_m__, ’out_param1’) and in_param2: (__TASK_m__, ’out_param2’). Assuming thatthe two tasks are synchronizable, parallel_params = [’in_param1’] forces the kronos to synchronizethe task __TASK_n__ to task __TASK_m__ over the parameter in_param1. In other words, task __TASK_n__is expanded as many time as the number of the children of task __TASK_m__ and each of its children gets its valuefor in_param1 from the out_param1 of one of the children of task __TASK_m__.

    interval_file

    An interval file contains a list of intervals or chunks which a task will use as input arguments for its children. Forexample if an interval file looks like:

    chunk1chunk2chunk3

    then each line, i.e. chunk1, chunk2, chunk3, will be passed separately to a children as an input argument. Thepath to the interval file is passed to the interval_file attribute.

    16 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Warning: If you want to use the interval file functionality in a task, the component of that task should support it.In other words, it should have the focus method in its component_main module. This method determines how andto which parameter a chunk should be passed.

    Component subsection

    This subsection contains all the input parameters of the component of the task. The parameters are categorized intothree subsections:

    • input_files: lists all the input files and directories

    • output_files: lists all the output files and directories

    • parameters: lists all the other parameters

    1.3.6 More on the configuration file

    configuration file flags

    kronos uses the following flags assigned to various parameters of different tasks:

    • __REQUIRED__: means that the user MUST specify value for that parameter.

    • __FLAG__: means that the parameter is a boolean flag. Users can assign True or False values to theparameter. The default value is False.

    Tip

    The default values for the parameters appear in the configuration file. If there is no default value, then either oneof the configuration file flags will be used or it is left blank.

    Note: Put quotation marks around string values, for example ‘GRCh37.66’. Unquoted strings, while accepted byYAML can result in unexpected behaviour.

    configuration file keywords

    You can use the following keywords in the configuration file which will be automatically replaced by proper values atruntime:

    Keyword Description$pipeline_name the name of the pipeline$pipeline_working_dir the path to the working directory$run_id run ID$sample_id the ID used in the samples section

    Warning: The character $ is part of the keyword and MUST be used.

    1.3. Configuration file 17

  • kronos Documentation, Release 2.2.0

    configuration file reserved keywords

    The following words are reserved for the kronos package:

    • reserved

    • run

    • component

    Warning: The reserved keywords can NOT be used as the name of parameters of components/tasks.

    Output directory customization

    kronos supports paths in the output_files subsection of the component subsection. In other words, user canspecify paths like /dir1/dir2/dir3/my.file to the parameters of the output_files subsection and all thedirectories in the path will be automatically made if they do not exist. For example, kronos will make directoriesdir1, dir2, dir3 with the given hierarchy. This mechanism enables developers to make any directory structure asdesired. Basically, they can organize the outputs directory of their pipeline directly from within the configurationfile. For instance, assume a pipeline has two tasks with components comp1 and comp2. The user can categorize theoutputs of these tasks by the names of their corresponding components as follows (note the values assigned to outand log parameters of each component):

    __TASK_i__:component:

    output_files:out: comp1/res/my_res_name.filelog: comp1/log/my_log_name.log

    __TASK_ii__:component:

    output_files:out: comp2/res/my_res_name.filelog: comp2/log/my_log_name.log

    so, the following tree is made inside the outputs directory given the above configuration file:

    outputs|____comp1| |____log| | |TASK_i_my_log_name.log| |____res| |TASK_i_my_res_name.file|____comp2

    |____log| |TASK_ii_my_log_name.log|____res

    |TASK_ii_my_res_name.file

    Tip

    Output filenames are always prepended by the task names to prevent overwriting, e.g. TASK_i and TASK_iiin the above example.

    18 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Tip

    If you want to specify a directory name to a parameter, you can do so by using / character at the end of thedirectory name. This instructs kronos to make the directory in the outputs directory or any other specifiedpath if the direcotry does not exist.

    1.4 Components

    This part explains how to make new components. You need to know the following definitions first:

    seed: a seed is a computer application, a program or in general a command line tool that performs a specific task. Thiscan be a simple bash copy command or a software suite like MutationSeq or Strelka.

    component: a component is a wrapper around a seed that makes the seed compatible with kronos so that the seedthen can be used as a part of a pipeline. In other words, components are the building blocks of the pipelines generatedby kronos.

    1.4.1 Develop a component

    The purpose of components is to modularize workflows with reusable building blocks that require minimal develop-ment. The number of lines of codes for making a new component is very small. The simple development instructionseliminate, for example, the need to use Ruffus decorators, input/output management using regex expressions and com-plicated dependency management in the code that can easily become very complex with the number of tasks in aworkflow. Furthermore, a large workflow can be divided into a set of small components that results in a much fasterand manageable workflow development.

    All command line tools can be used as seeds and therefore wrapped as Kronos components. Regardless of how com-plicated they are, their corresponding components have a standard directory structure composed of specific wrappersand sub-directories. The wrappers are agnostic to the programming language used for developing the seed. The com-ponents should be developed prior to making the workflow. However, since they are individually and independentlydeveloped and due to their reusability, the development of a component happens only once and then it can be used invarious pipelines.

    In order to develop a component you need to:

    1. create a directory with a name that is the same as the name you want for the component, e.g. my_comp.

    2. create a directory called “component_seed” inside the my_comp directory and copy the seed source code intoit.

    3. create the following files inside the my_comp directory:

    • __init__.py: an empty file.

    • component_main.py: the main python script that contains the Component class.

    • component_params.py: contains all the information about input/output parameters of the component.

    • component_reqs.py: contains all the information about the requirements of the component.

    • component_ui.py: an argparse UI for the component.

    1.4. Components 19

    http://compbio.bccrc.ca/software/mutationseq/https://sites.google.com/site/strelkasomaticvariantcaller/https://docs.python.org/3/library/argparse.html

  • kronos Documentation, Release 2.2.0

    Tip

    All the above files and directories are generated by Kronos make_component command. The user only needsto customize them.

    Tip

    The seed source code is not required to be copied into the component_seed directory. Instead, the seed canbe used as a requirement for the component which can be listed in the component_reqs.py.

    The component directory tree looks like the following:

    |-- | |-- __init__.py| |-- component_main.py| |-- component_params.py| |-- component_reqs.py| |-- component_ui.py| |-- component_seed

    Note: should be replaced with the actual name of the component, e.g. my_comp. The restof the file and directory names should be exactly as shown above.

    Tip

    It is recommended to add the following files and directories (not generated automatically) as well:• component_test: a directory where all the test files exist.• README: a readme file to provide more information about the component.

    Component_main

    The core of a component is the component_main.py python script. This module defines Component class whichextends the ComponentAbstract class.

    Using the make_component command, the following component_main.py file is generated:

    """component_main.pyThis module contains Component class which extendsthe ComponentAbstract class. It is the core of a component.

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    from kronos.utils import ComponentAbstractimport os

    class Component(ComponentAbstract):

    20 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    """TODO: add component doc here."""

    def __init__(self, component_name="my_comp",component_parent_dir=None, seed_dir=None):

    ## TODO: pass the version of the component here.self.version = "v0.99.0"

    ## initialize ComponentAbstractsuper(Component, self).__init__(component_name,

    component_parent_dir, seed_dir)

    ## TODO: write the focus method if the component is parallelizable.## Note that it should return cmd, cmd_args.def focus(self, cmd, cmd_args, chunk):

    pass# return cmd, cmd_args

    ## TODO: this method should make the command and command arguments## used to run the component_seed via the command line. Note that## it should return cmd, cmd_args.def make_cmd(self, chunk=None):

    ## TODO: replace 'comp_req' with the actual component## requirement, e.g. 'python', 'java', etc.cmd = self.requirements['comp_req']

    cmd_args = []

    args = vars(self.args)

    ## TODO: fill the following component params to seed params dictionary## if the name of parameters of the seed are different than## component parameter names.comp_seed_map = {

    #e.g. 'component_param1': 'seedParam1',#e.g. 'component_param2': 'seedParam2',}

    for k, v in args.items():if v is None or v is False:

    continue

    ## TODO: uncomment the next line if you are using## comp_seed_map dictionary.# k = comp_seed_map[k]

    cmd_args.append('--' + k)

    if isinstance(v, bool):continue

    if isinstance(v, str):v = repr(v)

    if isinstance(v, (list, tuple)):cmd_args.extend(v)

    else:cmd_args.extend([v])

    1.4. Components 21

  • kronos Documentation, Release 2.2.0

    if chunk is not None:cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

    return cmd, cmd_args

    ## To run as stand alonedef _main():

    c = Component()c.args = component_ui.argsc.run()

    if __name__ == '__main__':import component_ui_main()

    Note: Note the places you need to change the generated file to make it work for you are marked with keyword‘TODO’.

    There are two methods in this file that you need to customize:

    • focus

    • make_cmd

    focus method

    Each parallelizable component will require a focus method. The purpose of this method is to tell the componentto process only one chunk of the input data rather than the entire file. How this is done will vary depending on thecomponent, but basically will add to, or alter the component command to this end. For example, in the followingimplementation, focus method simply passes the chunk to the --interval option in the command argumentscmd_arg (most of the time, this implementation does the job):

    focus(cmd, cmd_args, chunk):cmd_args.append('--interval ' + chunk)return cmd, cmd_args

    Note: You need to implement focus method only if the component is parallelizable.

    make_cmd method

    All the components should implement this method in their component_main.py. This method essentially returnsthe command string that one can use to run the seed on a command line. For example, if the seed can be run using thefollowing command:

    python my_seed_command.py --foo data1 --bar data2

    then make_cmd method would look like this (note that we only need to change the first two lines of the default filemade by kronos):

    def make_cmd(self, chunk):path = os.path.join(self.seed_dir, 'my_seed_command.py')cmd = self.requirements['python'] + ' ' + path

    22 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    cmd_args = []

    args = vars(self.args)

    ## TODO: fill the following component params to seed params dictionary## if the name of parameters of the seed are different than## component parameter names.comp_seed_map = {

    #e.g. 'component_param1': 'seedParam1',#e.g. 'component_param2': 'seedParam2',

    }

    for k, v in args.items():if v is None or v is False:

    continue

    ## TODO: uncomment the next line if you are using## comp_seed_map dictionary.# k = comp_seed_map[k]

    cmd_args.append('--' + k)

    if isinstance(v, bool):continue

    if isinstance(v, str):v = repr(v)

    if isinstance(v, (list, tuple)):cmd_args.extend(v)

    else:cmd_args.extend([v])

    if chunk is not None:cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

    return cmd, cmd_args

    Tip

    In the above example, python is a requirement for the component and should be added to the compo-nent_reqs.py of the component. Also, parameters foo and bar should be added to the component_params.py.

    ComponentAbstract class

    This class comprises of the following attributes and methods:

    Attributes:

    1.4. Components 23

  • kronos Documentation, Release 2.2.0

    Attribute Descriptionargs the argparse namespace containing all the input arguments from the Component_ui modulecomponents_dir path to the directory where the component existscompo-nent_name

    name of the component - specific

    compo-nent_params

    Component_params module of the component

    compo-nent_reqs

    Component_reqs module of the component

    env_vars see Component_reqsmemory see Component_reqsparallel see Component_reqsrequirements see Component_reqsseed_dir path to the directory where the seed exists. Most of the time it is

    /seed_version version of the seedversion version of the component - specific

    Tip

    specific means it should be assigned value when implementing a component.

    Methods:Method Description__init__ initialize general attributes that each component must haverun run the component command locallyfocus update the command and command arguments for each chunk - virtualmake_cmdmake the command used to run the seed of the component. This returns the same command that one

    would use to run the component as a stand alone program via command line - virtualtest run unittest of the component - virtual

    Tip

    The class can be imported from utils module from the kronos package:

    from kronos.utils import ComponentAbstract

    Component_params

    This is a python module and contains the following information:

    • input_files: a dictionary with keys being the input file parameters and the values being the default values or aproper flags based on the component UI. For example:

    input_files={'samples':['tumour:__REQUIRED__','normal:__REQUIRED__','reference:__REQUIRED__','model:__REQUIRED__'],

    'config':'some_default.cfg','positions_file':None}

    24 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Note: This dictionary includes only parameters that expect input files or directories.

    • output_files: a dictionary with keys being the output file parameters and the values being the default values or aproper flags based on the component ui. For example:

    output_files = {'export_features':None,'log_file':'mutationSeq_run.log','out':None}

    Note: This dictionary includes only parameters that expect output files or directories.

    • input_params: a dictionary with keys being the input non_file parameters and the values being the default valuesor a proper flags based on the component ui. For example:

    input_params = {'all':'__FLAG__','buffer_size':'2G','coverage':4,'deep':'__FLAG__','interval':None,'no_filter':'__FLAG__'}

    Note: All other parameters that are not included in input_files and output_files should be listed in input_params.

    Using the make_component command, the following component_params.py file is generated:

    """component_params.py

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    ## TODO: here goes the list of the input files. Use flags:## '__REQUIRED__' to make it required## '__FLAG__' to make it a flag or switch.input_files = {# 'input_file1' : '__REQUIRED__',# 'input_file2' : None

    }

    ## TODO: here goes the list of the output files.output_files = {# 'output_file1' : '__REQUIRED__',# 'output_file1' : None

    }

    ## TODO: here goes the list of the input parameters excluding input/output files.input_params = {# 'input_param1' : '__REQUIRED__',# 'input_param2' : '__FLAG__',# 'input_param3' : None

    }

    1.4. Components 25

  • kronos Documentation, Release 2.2.0

    ## TODO: here goes the return value of the component_seed.## DO NOT USE, Not implemented yet!return_value = []

    Example: This is an example showing the content of a component_params.py file:

    input_files = {'tumour':'__REQUIRED__','normal':'__REQUIRED__','reference':'__REQUIRED__','model':'__REQUIRED__','config':'metadata.config','positions_file':None}

    output_files = {'export_features':None,'log_file':'mutationSeq_run.log','out':'__REQUIRED__'}

    input_params = {'all':'__FLAG__','buffer_size':'2G','coverage':4,'deep':'__FLAG__','interval':None,'no_filter':'__FLAG__','normalized':'__FLAG__','normal_variant':25,'purity':70,'mapq_threshold':20,'baseq_threshold':10,'indl_threshold':0.05,'manifest':'__OPTIONAL__','single':'__FLAG__','threshold':0.5,'tumour_variant':2,'features_only':'__FLAG__','verbose':'__FLAG__','titan_mode':'__FLAG__'}

    Component_reqs

    This is a python module and contains the following information:

    • env_vars: a dictionary with keys being the name of environment variables and values being the path/content toexport. The values can be updated in the configuration file using env_var in the run subsection. Therefore, it isrecommended not include the paths as values in this file and instead use an empty list, [], or None as a value.

    • memory: specifies the minimum memory required by the component to properly run on a cluster. The format isnG, e.g. 30G.

    • parallel: a boolean flag that specifies whether or not a component can run in parallel mode.

    • requirements: a dictionary with keys usually being the name of a program/software and values being None or theflag __REQUIRED__ . The values will be later updated by kronos using the content of the __GENERAL__section.

    • seed_version: the version of the seed.

    26 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    • version: the version of the component.

    Using the make_component command, the following component_reqs.py file is generated:

    """component_reqs.py

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    ## TODO: here goes the list of the environment variables, if any,## required to export for the component to function properly.env_vars = {# 'env_var1' : ['value1', 'value2'],# 'env_var2' : 'value3'

    }

    ## TODO: here goes the max amount of the memory required.memory = '5G'

    ## TODO: set this to True if the component is parallelizable.parallel = False

    ## TODO: here goes the list of the required software/apps## called by the component.requirements = {# 'python': '__REQUIRED__',

    }

    ## TODO: here goes the version of the component seed.seed_version = '0.99.0'

    ## TODO: here goes the version of the component itself.version = '0.99.0'

    Example: This is an example showing the content of a component_reqs.py:

    env_vars = {'LD_LIBRARY_PATH': []}

    memory = '4G'

    parallel = True

    requirements = {'java': '__REQUIRED__'}

    seed_version = 'version 3.2'

    version = 'v1.0.1'

    Component_ui

    It is a python module that contains an argparse UI for the component. Using the make_component command, thefollowing component_ui.py file is generated:

    """component_ui.py

    Note the places you need to change to make it work for you.

    1.4. Components 27

    https://docs.python.org/3/library/argparse.html

  • kronos Documentation, Release 2.2.0

    They are marked with keyword 'TODO'."""

    import argparse

    #==============================================================================# make a UI#==============================================================================## TODO: pass the name of the component to the 'prog' parameter and a## brief description of your component to the 'description' parameter.parser = argparse.ArgumentParser(prog='my_comp',

    description = """brief description of your component goes here.""")

    ## TODO: create the list of input options here. Add as many as desired.parser.add_argument(

    "-x", "--xparam",default = None,help= """help message goes here.""")

    ## parse the argument parser.args, unknown = parser.parse_known_args()

    Example: This is an example showing the content a component_ui.py:

    import sysimport argparse

    #==============================================================================# make a UI#==============================================================================parser = argparse.ArgumentParser(prog='snpeff',

    description='''Genetic variant annotation and effectprediction toolbox. It annotates and predicts theeffects of variants on genes (such as amino acid changes)''',epilog='''Input file: Default is STDIN''')

    # required argumentsrequired_arguments = parser.add_argument_group("Required arguments")

    required_arguments.add_argument("--out",default=None,required=True,help='''specify the path/to/out.vcf to save output to a file''')

    # mandatory / positional argumentsrequired_arguments.add_argument("genome_version",

    choices=['GRCh37.66'],help='''genomic build version''')

    required_arguments.add_argument("variants_file",help='''file containing variants''')

    # optional optionsoptional_options = parser.add_argument_group("Options")

    28 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    optional_options.add_argument("-a", "--around",default=False, action="store_true",help='''Show N codons and amino acids around change(only in coding regions). Default is 0 codons.''')

    args, unknown = parser.parse_known_args()

    Warning: It is required to use parse_known_args instead of parse_args.

    component_seed

    This is a directory within the component directory where all the source codes of the actual program reside.

    1.4.2 Examples

    Please refer to our Github repositories for more examples. The repositories with the postfix _workflow are thepipelines and the rest are the components.

    1.5 Pipelines

    1.5.1 Create a new pipeline

    To create a new pipeline using Kronos, you simply need to follow two steps:

    1. generate a configuration file using make_config command.

    2. configure the pipeline by customizing the resulting configuration file, i.e. pass proper values to the attributes inthe run subsection and set the connections.

    Note: If you need a component that does not already exist, then you need to make the component first.

    Examples

    Please refer to our Github repositories for more examples. The repositories with the postfix _workflow are thepipelines and the rest are the components.

    1.5.2 Launch a pipeline

    Essentially, you have two options to launch the pipelines generated by Kronos:

    1. (Recommended) use run command to initialize and run in one step.

    2. use init command to initialize the pipeline first, then run the resulting Python script.

    Note: Make sure the version of the Kronos package installed on your machine is compatible with the version used togenerate the configuration file which is shown at the top of the configuration file in __PIPELINE_INFO__ section:

    1.5. Pipelines 29

    https://github.com/MO-BCCRC?tab=repositorieshttps://github.com/MO-BCCRC?tab=repositories

  • kronos Documentation, Release 2.2.0

    1. Run the pipeline using run command

    It is very easy to run a pipeline using run command:

    kronos run -k -c [options]

    Input options of run command

    This is the list of all the input options you can use with run command:

    30 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Option Default Description-h or –help False print help - optional-b or –job_scheduler drmaa job scheduler used to manage jobs on the cluster -

    optional-c or–components_dir

    None path to components_dir- required

    -d or –drmaa_library_path

    lib/lx24-amd64/libdrmaa.so path to drmaa_library - optional

    -e or–pipeline_name

    None pipeline name - optional

    -i or –input_samples None path to the input samples file - optional-j or –num_jobs 1 maximum number of simultaneous jobs per pipeline -

    optional-k or–kronos_pipeline

    None path to Kronos-made pipeline script- optional

    -n or–num_pipelines

    1 maximum number of simultaneous running pipelines -optional

    -p or –python_installation

    python path to python executable - optional

    -q or –qsub_options None native qsub specifications for the cluster in a single string- optional

    -r or –run_id None (current timestampwill be used)

    pipeline run id - optional

    -s or –setup_file None path to the setup file- optional-w or –working_dir current working directory path to the working directory - optional-y or –config_file None path to the config_file.yaml- optional–no_prefix False switch off the prefix that is added to all the output files

    by Kronos - optional

    Note: “-c or –components_dir” is required to specify.

    On --qsub-options option

    There are a few keywords that can be used with --qsub_options option. These keywords are replaced withcorresponding values from the run subsection of each task when the job for that task is submitted:

    • mem: will be replaced with memory from run subsection

    • h_vmem: will be replaced with 1.2 * memory.

    • num_cpus: will be replaced with num_cpus from run subsection

    For example:

    --qsub_options " -pe ncpus {num_cpus} -l mem_free={mem} -l mem_token={mem} -l h_vmem={h_vmem} [other options]"

    Note: If you specify --qsub_options option with hard values (i.e. not using these keywords), they will overwritethe values in the run subsection.

    1.5. Pipelines 31

  • kronos Documentation, Release 2.2.0

    Initialize using run command

    If you only have the configuration file and not the pipeline script, you can still use run command. To do so, simplypass the configuration file using -y option. This instructs Kronos to initialize the pipeline first and run the resultingpipeline script subsequently. In this case, you do not have to specify -k option.

    Tip

    You can use -s and -i when you use -y to input sample file and setup file, respectively.

    Warning: If you specify both -y and -k with run command, Kronos would use -y and ignores -k.

    Note: When using run command, you cannot initialize only (i.e. without running the pipeline). Use init commandif you only want to make a pipeline script.

    Run the tasks locally, on a cluster or in the cloud

    When launching a pipeline, each task in the pipeline can individually be run locally or on a cluster. For this you needto use the use_cluster attribute for each task in the configuration file.

    You can also launch the pipeline in the cloud. Please refer to the Deploy Kronos to the cloud for more information.

    2. Run the pipeline using init command and the resulting pipeline script

    You can launch a pipeline by using init command to create a pipeline script first:

    kronos -w init -y -e

    and then by running the script.

    The init command has the followig input options:

    Option Default Description-h or –help False print help - optional-e or –pipeline_name None pipeline name - required-i or –input_samples None path to the input samples file - optional-s or –setup_file None path to the setup file- optional-y or –config_file None path to the config_file.yaml- required

    Samples file

    It is a tab-delimited file that lists the content of SAMPLES section of the configuration file. You can use the inputoption -i to pass this file when using init or run commands.

    The content of the file should look like the following:

    #sample_id ... ... ...

    where:

    32 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    • the header always start with #sample_id and the rest of it, the ‘s, are the keys used in key:valuepairs.

    • the ‘s should be unique ID’s, e.g. DAH498, Rx23D, etc.

    • the ‘s are the corresponding values of the keys in the header.

    For instance, the following is the content of an actual samples file:

    #sample_id bam outputDG123 /genesis/extscratch/data/DG123.bam DG123_analysis.vcfDG456 /genesis/extscratch/data/DG456.bam DG456_analysis.vcf

    If this file is passed to the -i option, the resulting configuration file would have a SAMPLES section looking like this:

    __SAMPLES__:DG123:

    output: 'DG123_analysis.vcf'bam: '/genesis/extscratch/data/DG123.bam'

    DG456:output: 'DG456_analysis.vcf'bam: '/genesis/extscratch/data/DG456.bam'

    Info

    Kronos uses the samples file to update (not to overwrite) SAMPLES section which means that if an ID inthe setup file already exists in the SAMPLES section of the configuration file, the value of the ID is updated.Otherwise, the new sample ID entry is added to the section and the rest of the section remains unchanged.

    Setup file

    It is a tab-delimited file that lists the key:value pairs that should go in GENERAL or SHARED sections of theconfiguration file. You can use the input option -s to pass this file when using init or run commands.

    The content of the file should look like the following:

    #section key value

    where:

    • the header should always be: #section key value (tab-delimited).

    • can be either __GENERAL__ or __SHARED__.

    For instance, the following is the content of an actual setup file:

    #section key value__GENERAL__ python /genesis/extscratch/pipelines/apps/anaconda/bin/python__GENERAL__ java /genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java__SHARED__ reference /genesis/extscratch/pipelines/reference/GRCh37-lite.fa__SHARED__ ld_library_path ['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']

    If this file is passed to the -s option, the resulting configuration file would have GENERAL and SHARED sectionslooking like this:

    1.5. Pipelines 33

  • kronos Documentation, Release 2.2.0

    __GENERAL__:python: '/genesis/extscratch/pipelines/apps/anaconda/bin/python'java: '/genesis/extscratch/pipelines/apps/jdk1.7.0_06/bin/java'

    __SHARED__:ld_library_path: "['/genesis/extscratch/pipelines/apps/anaconda/lib','/genesis/extscratch/pipelines/apps/anaconda/lib/lib']"reference: '/genesis/extscratch/pipelines/reference/GRCh37-lite.fa'

    Info

    Kronos uses the setup file to update (not to overwrite) GENERAL and SHARED sections which means thatif a key in the setup file already exists in the target section, the value of that key is updated. Otherwise, thekey:value pair is added to the target section and the rest of the pairs in the target section remain unchanged.

    Run the pipeline script generated by init command

    All the pipeline scripts generated by Kronos init command can also be run as following:

    python -c [options]

    where my_pipeline.py is the pipeline script you want to run.

    Warning: It is required to pass the path of the components_dir to the input option -c when running thepipeline. See What is the components directory? for more information on components_dir.

    This is the list of all the input options you can use:

    Option Default Description-h or –help False print help - optional-b or –job_scheduler drmaa job scheduler used to manage jobs on the cluster -

    optional-c or–components_dir

    None path to components_dir- required

    -d or –drmaa_library_path

    lib/lx24-amd64/libdrmaa.so path to drmaa_library - optional

    -e or–pipeline_name

    None pipeline name - optional

    -j or –num_jobs 1 maximum number of simultaneous jobs per pipeline -optional

    -l or –log_file None name of the log file - optional-n or–num_pipelines

    1 maximum number of simultaneous running pipelines -optional

    -p or –python_installation

    python path to python executable - optional

    -q or –qsub_options None native qsub specifications for the cluster in a single string- optional

    -r or –run_id None (current timestampwill be used)

    pipeline run id - optional

    -w or –working_dir current working directory path to the working directory - optional–no_prefix False switch off the prefix that is added to all the output files

    by Kronos - optional

    34 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    What is the components directory?

    It is the directory where you have cloned/stored all the components. The generated pipeline has the input option -c or--components_dir that requires the path to that directory.

    Note: Note that components_dir is always the parent directory that contains the component(s). For ex-ample, if you have a component called comp1 in the path ~/my_components/comp1, you should pass~/my_components to the -c option:

    Results generated by a pipeline

    When a pipeline is run, a directory is made inside the working directory with its name being the run ID. All the outputfiles and directories are stored here, i.e. in //.

    What is the working directory?

    It is a directory used by Kronos to store all the resulting files. The user can specify the path to its desired workingdirectory via input option -w.

    Tip

    If the directory does not exist, then it will be made.

    Tip

    If you do not specify the working directory, the current directory would be used instead.

    What is the run ID?

    Each time a pipeline is run, a unique ID is generated for that run unless it is specified using -r option by the user.This ID is used for the following purposes:

    • to trace back the run, i.e logs, results, etc.

    • to enable re-running the same incomplete run, which it will automatically pick up from where it left off

    • to avoid overwriting the results if the same working directory is used for all the runs

    Info

    The ID generated by Kronos (if -r not specified) is a timestamp: ‘year-month-day_hour-minute-second’.

    What is the structure of the results directory generated by a pipeline?

    The following tree shows the general structure of the // directory where the resultsare stored:

    1.5. Pipelines 35

  • kronos Documentation, Release 2.2.0

    |-- | |-- _| | |-- logs| | |-- outputs| | |-- scripts| | |-- sentinels| |-- _| | |-- logs| | |-- outputs| | |-- scripts| | |-- sentinels| |-- _.yaml| |-- _.log

    where:

    • an individual subdirectory is made with name _ for each sample in theSAMPLES section.

    • there are always the following four subdirectories in the _ directory:

    – logs: where all the log files are stored

    – outputs: where all the resulting files are stored

    – scripts: where all the scripts used to run the components are stored

    – sentinels: where all the sentinel files are stored

    If there is not any samples in the SAMPLES section, then a subdirectory with name__shared__only___ is made instead of _. Infact, since there are no ID’s in the SAMPLES section, Kronos uses the string __shared__only__ to idicate thatSAMPLES section is empty.

    Note: The developer of the pipeline can customize the content of the outputs directory (see Output directorycustomization for more information). So, you might see more directories inside that directory.

    Info

    scripts direcotry is used by Kronos to store and manage the scripts and should not be modified.

    Info

    Sentinel files mark the successful completion of a task in the pipeline. sentinels directory is simply used forstoing these files.

    How can I relaunch a pipeline?

    If you have run a pipeline and it has stopped at some point for any reason, e.g. a breakpoint or an error, you can re-runit from where it left off. For that purpose, simply use the exact same command you used in the first place but onlymake sure that you also pass the run ID of the first run to the input option -r.

    36 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Note: If you forget to pass the run ID or pass a nonexistent run ID by mistake, Kronos considers that as a new runand launches the pipeline from scratch. This will not overwrite your previous results.

    Tip

    If you want to relaunch a pipeline from an arbitrary task (that already has a sentinel file), you need to goto the sentinels directory and delete the sentinel file corresponding to that task. Then relaunch the pipelineas mentioned above. Remember that all the next tasks that have connections to this task will also be re-runregardless of whether or not they have a sentinel file. The reason for this is that Kronos checks the timestampof the sentinels and if the sentinels of the next task are outdated compared to the current task, it will re-run themtoo.

    Tip

    If you want to run a part of a pipeline between two tasks (two breakpoints) for several times, each time youneed to delete the sentinel files of the tasks between the two breakpoints as well as the sentinel file of the secondbreakpoint. In the new version, we’re working on making this easier by eliminatig the need to delete thesesentinels each time.

    Tip

    A sentinel file name looks like TASK_i__sentinel_file. For the breakpoints, the sentinel file name lookslike __BREAK_POINT_TASK_i__sentinel_file.

    1.6 Guides

    1.6.1 Quick tutorial

    This section can help you quickly make a component, make a pipeline, and run the pipeline step by step using somaticvariant caller Strelka. You need to refer to the rest of the documentation for details.

    First, you need to download the strelka_workflow.tar.gz tarball from our ftp server:ftp://ftp.bcgsc.ca/public/shahlab/kronos, and unpack it into a desired directory, say $HOME_DIR:

    $ cd $HOME_DIR$ tar -xvf strelka_workflow.tar.gz

    This will make a directory called strelka_workflow in $HOME_DIR.

    The test input data is a lightweight pair of cell line tumour/normal bam files that can be downloaded from here:

    • Exome normal bam file

    • Exome tumour bam file

    We are using this lightweight pair of bam files so that this can be done on a local computer with decent amount ofmemory ( >=8G).

    1.6. Guides 37

    https://sites.google.com/site/strelkasomaticvariantcaller/https://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395/gerald_C1TD1ACXX_7_CGATGT.bamhttps://xfer.genome.wustl.edu/gxfer1/project/gms/testdata/bams/hcc1395/gerald_C1TD1ACXX_7_ATCACG.bam

  • kronos Documentation, Release 2.2.0

    Make a component

    We are going to make a component called plot_strelka which will later be used in the next section. The seed of thiscomponent is an R script called plot_strelka.R which is included in the tarball in the seeds directory. It takesa Strelka result file as an input and generates a plot.

    Note: Refer to the Components for more information on the definition of seed and component.

    The seed can be run by the following command:

    $ R --no-save --args < plot_strelka.R

    where and should be replaced with a Strelka result file and a name for the outputfile, respectively.

    We can make a component for this seed by following steps 1 to 6:

    Step 1. Make a component template:

    $ kronos make_component plot_strelka

    This will make a component template called plot_strelka in the current working directory.

    Step 2. Copy the seed to the plot_strelka/component_seed directory.

    Step 3. Open the plot_strelka/component_main.py template file and add the following lines to the begin-ning of the make_cmd function:

    cmd = self.requirements['R'] + ' --no-save --args 'cmd_args = [self.args.infile, self.args.outfile_name]cmd_args.append('

  • kronos Documentation, Release 2.2.0

    if isinstance(v, bool):continue

    if isinstance(v, str):v = repr(v)

    if isinstance(v, (list, tuple)):cmd_args.extend(v)

    else:cmd_args.extend([v])

    Therefore the final component_main.py would look like:

    """component_main.pyThis module contains Component class which extendsthe ComponentAbstract class. It is the core of a component.

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    from kronos.utils import ComponentAbstractimport os

    class Component(ComponentAbstract):

    """TODO: add component doc here."""

    def __init__(self, component_name="plot_strelka",component_parent_dir=None, seed_dir=None):

    ## TODO: pass the version of the component here.self.version = "v0.99.0"

    ## initialize ComponentAbstractsuper(Component, self).__init__(component_name,

    component_parent_dir, seed_dir)

    ## TODO: write the focus method if the component is parallelizable.## Note that it should return cmd, cmd_args.def focus(self, cmd, cmd_args, chunk):

    pass# return cmd, cmd_args

    ## TODO: this method should make the command and command arguments## used to run the component_seed via the command line. Note that## it should return cmd, cmd_args.def make_cmd(self, chunk=None):

    ## TODO: replace 'comp_req' with the actual component## requirement, e.g. 'python', 'java', etc.cmd = self.requirements['R'] + ' --no-save --args 'cmd_args = [self.args.infile, self.args.outfile_name]cmd_args.append('

  • kronos Documentation, Release 2.2.0

    cmd, cmd_args = self.focus(cmd, cmd_args, chunk)

    return cmd, cmd_args

    ## To run as stand alonedef _main():

    c = Component()c.args = component_ui.argsc.run()

    if __name__ == '__main__':import component_ui_main()

    Step 4. As mentioned earlier, the seed takes as an input a Strelka result file and a name for the output file.We have chosen the names infile and outfile_name to represent these inputs, respectively. So, open theplot_strelka/component_params.py template file and add the names to it as follows:

    """component_params.py

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    ## TODO: here goes the list of the input files. Use flags:## '__REQUIRED__' to make it required## '__FLAG__' to make it a flag or switch.input_files = {

    'infile' : '__REQUIRED__',# 'input_file2' : None

    }

    ## TODO: here goes the list of the output files.output_files = {

    'outfile_name' : '__REQUIRED__',# 'output_file1' : None

    }

    ## TODO: here goes the list of the input parameters excluding input/output files.input_params = {# 'input_param1' : '__REQUIRED__',# 'input_param2' : '__FLAG__',# 'input_param3' : None

    }

    ## TODO: here goes the return value of the component_seed.## DO NOT USE, Not implemented yet!return_value = []

    Note: You only need to change the following two lines:

    'infile' : '__REQUIRED__','outfile_name' : '__REQUIRED__',

    Step 5. Open the plot_strelka/component_reqs.py template file and only change the following line:

    40 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    requirements = {# 'python': '__REQUIRED__',

    }

    to this:

    requirements = {'R': '__REQUIRED__',

    }

    The rest of the fields in this file can also be changed if desired but is not required.

    Step 6 (Optional). This step can be skipped. It is only needed if you want to run the component as standalone outsideof a pipeline. This step creates a user interface for the component. Open the plot_strelka/component_ui.pytemplate file and change it so it looks like:

    """component_ui.py

    Note the places you need to change to make it work for you.They are marked with keyword 'TODO'."""

    import argparse

    #==============================================================================# make a UI#==============================================================================## TODO: pass the name of the component to the 'prog' parameter and a## brief description of your component to the 'description' parameter.parser = argparse.ArgumentParser(prog='plot_strelka',

    description = """creates a plot from Strelka results.""")

    ## TODO: create the list of input options here. Add as many as desired.parser.add_argument(

    "--infile",default = None,required = True,help= """input file.""")

    parser.add_argument("--outfile_name",default = None,required = True,help= """a name for the output file.""")

    ## parse the argument parser.args, unknown = parser.parse_known_args()

    Make a pipeline

    This section explains how to create a pipeline that runs Strelka and plots its results. For this purpose, weneed to use the components included in the strelka_workflow.tar.gz tarball. There are two compo-

    1.6. Guides 41

    https://sites.google.com/site/strelkasomaticvariantcaller/

  • kronos Documentation, Release 2.2.0

    nents called run_strelka and plot_strelka in $HOME_DIR/strelka_workflow/components where$HOME_DIR is where you unpacked the tarball. You can also use the component we created in Make a componentfor plot_strelka.

    Next, export the components path to PYTHONPATH environment variable:

    export PYTHONPATH=$HOME_DIR/strelka_workflow/components:$PYTHONPATH

    Now, we can start making a new pipeline using these component:

    Step 1. Make a new configuration file using the make_config command:

    kronos make_config run_strelka plot_strelka -o strelka_workflow

    This will create a new configuration file called strelka_workflow.yaml in the current working directory. Thisfile has a number of sections which we go through step by step. Please refer to Configuration file to learn more abouteach section.

    Step 2 (optional). The first section in the configuration file is __PIPELINE_INFO__ and contains informationregarding the pipeline. Here is an example for this configuration file:

    name: 'run_plot_strelka'version: '1.0'author: 'Jafar Taghiyar'data_type: 'SNV'input_type: 'bam'output_type: 'vcf, jpeg'host_cluster: 'local'date_created: '2016-01-04'date_last_updated:Kronos_version: '2.0.4'

    This section is only informative and does not have any effects on the pipeline.

    Step 3. The second section is __GENERAL__ that lists the requirements of all the components in the pipeline. Therequirements listed in the __GENERAL__ section apply to all the components in the pipeline.

    In this pipeline, it looks like this:

    strelka: '__REQUIRED__'R: '__REQUIRED__'perl: '__REQUIRED__'

    These entries are required. However, these values can come from a setup file when running the pipeline (see How torun the pipeline). So, for now you do not need to pass values to them in the configuration file.

    Info

    Each component can also have its own requirements specified individually. However, in this quick tutorial youneed to simply leave them blank. For more information refer to here.

    Step 4. The next section is __SHARED__ where we can create variables.

    In this pipeline, we add the following variable to this section:

    __SHARED__:strelka_ref: #a reference genome

    Similar to __GENERAL__ section, the value for this entry can come from the setup file when running the pipeline(see How to run the pipeline). In Step 6, we will see how we use it.

    42 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    Step 5. Next is __SAMPLES__ section that can be used to list the input files or parameters. By default the sectionlooks like:

    __SAMPLES__:# sample_id:# param1: value1# param2: value2

    In this pipeline, the input files are a pair of tumour/normal bam files. Also, we choose a parameter from Strelkacomponent called min_tier2_mapq to included as input in this section to show the functionality of the section.

    The content of this section can be provided in an input file when running the pipeline (see How to run the pipeline).So, a user does not need to pass values here.

    In Step 6, we will see how we use the content of this section.

    Step 6. The rest of the configuration file contains __TASK__ sections. These sections are where the connectionsamong different components in the pipeline are specified. We also need to pass proper values to all the parametersin these sections that have __REQUIRED__ keyword as input. For example, in this pipeline, we have the followingentries with __REQUIRED__ as their values that we need to pass actual values to:

    • in __TASK_1__ section:

    tumour: __REQUIRED__ref: __REQUIRED__normal: __REQUIRED__output_dir: __REQUIRED__

    • in __TASK_2__ section:

    infile: __REQUIRED__outfile_name: __REQUIRED__

    Some of these entries will be filled when specifying the flow of the pipeline. For example, we like to get the input forthe first task from an input file, i.e. from __SAMPLES__ section. For this purpose, we use IO connections:

    __TASK_1__:..

    component:input_files:

    tumour: ('__SAMPLES__', 'tumour')..normal: ('__SAMPLES__', 'normal')

    Next, we want the second task, __TASK_2__, to get its input from the output of the first task, __TASK_1__.Therefore, we simply pass the name of the output file from Strelka in the first task to the infile parameter of thesecond task:

    __TASK_2__:..

    component:input_files:

    infile: passed.somatic.snvs.vcf

    and then we add the __TASK_1__ to the forced_dependecies of __TASK_2__ which makes __TASK_2__to wait for __TASK_1__ to finish first:

    1.6. Guides 43

  • kronos Documentation, Release 2.2.0

    __TASK_2__:..

    run:..

    forced_dependencies: ['__TASK_1__']

    Note: Since Strelka software enforces the name of its result file to be passed.somatic.snvs.vcf, we need to use theexact same name in the configuration file and then use the forced_dependencies. Otherwise, an IO connectioncould help automatically pass the output of one task to the input of another task. It also manages the dependenciesautomatically.

    So far, We have already passed desired values to tumour, normal, and infile. For output_dirand outfile_name we only need to pick names. Let’s choose strelka_output andresults/passed.somatic.snvs.pdf for them, respectively:

    __TASK_1__:..

    component:..

    output_files:output_dir: strelka_output

    __TASK_2__:..

    component:..

    output_files:outfile_name: results/passed.somatic.snvs.pdf

    Note: The results/ in results/passed.somatic.snvs.pdf instructs Kronos to make a directory calledresults and copy the result file passed.somatic.snvs.pdf there.

    Since we want to enable a user to pass a reference genome in the setup file, i.e. without having to change the config-uration file, we pass it as a variable in __SHARED__ section (see Step 4) and use a connection to refer to it in theconfiguration file:

    __TASK_1__:..

    component:input_files:..

    ref: ('__SHARED__','strelka_ref')

    All the connections will be automatically replaced in the runtime.

    Step 7. In Step 5, we chose the min_tier2_mapq parameter to be included in the __SAMPLES__ section. There-fore, in order to use it we need to add a connection as follows:

    44 Chapter 1. Table of Contents

  • kronos Documentation, Release 2.2.0

    __TASK_1__:..

    component:parameters:..

    min_tier2_mapq: ('__SAMPLES__','mapq2')

    Note: mapq2 in the above line, is only an arbitrary key that we choose and it can be a different name. This key isused in the input file when running the pipeline later in Run a pipeline.

    The final configuraion file looks like this:

    __PIPELINE_INFO__:name: 'run_plot_strelka'version: '1.0'author: 'Jafar Taghiyar'data_type: 'SNV'input_type: 'bam'output_type: 'vcf, jpeg'host_cluster: 'local'date_created: '2016-01-04'date_last_updated:Kronos_version: '2.0.4'

    __GENERAL__:strelka: '__REQUIRED__'R: '__REQUIRED__'perl: '__REQUIRED__'

    __SHARED__:strelka_ref: #a reference genome

    __SAMPLES__:# sample_id:# param1: value1# param2: value2

    __TASK_1__:reserved:

    # do not change this section.component_name: 'run_strelka'component_version: '1.2.0'seed_version: '1.0.13'

    run:# NOTE: component cannot run in parallel mode.use_cluster: Truememory: '10G'num_cpus: 1forced_dependencies: []add_breakpoint: Falseenv_vars:boilerplate:requirements:

    strelka:perl:

    component:input_files:

    tumor: "('__SAMPLES__', 'tumour')"

    1.6. Guides 45

  • kronos Documentation, Release 2.2.0

    config:ref: "('__SHARED__','strelka_ref')"normal: "('__SAMPLES__', 'normal')"

    parameters:skip_depth_filters: "('__SHARED__','strelka_exome')"min_tier1_mapq: 20num_procs: 8min_tier2_mapq: "('__SAMPLES__','mapq2')"

    output_files:output_dir: 'strelka_output'

    __TASK_2__:reserved:

    # do not change this section.component_name: 'plot_strelka'component_version: '0.99.0'seed_version: '0.99.0'

    run:# NOTE: component cannot run in parallel mode.use_cluster: Truememory: '5G'num_cpus: 1forced_dependencies: ['__TASK_1__']add_breakpoint: Falseenv_vars:boilerplate:requirements:

    R:component:

    input_files:infile: 'passed.somatic.snvs.vcf'

    parameters:output_files:

    outfile_name: 'results/passed.somatic.snvs.pdf'

    Run a pipeline

    In this section, we are going to run the simple tumour/normal pair single nucleotide variant calling pipelnie that wemade in Make a pipeline. This pipeline consists of two tasks:

    • task 1: runs Strelka on a pair of tumour and normal bam files.

    • task 2: creates a series of plots from Strelka output.

    Requirements

    • Python >= v2.7.6

    • Strelka == v1.0.14

    • Java >= v1.7.0_06

    • Perl >= v5.8.8+

    How to run the pipeline

    Step 1. Create a file cal