MAPLE Submission Data Maker Ver. 1.0 User’s Manual JAMSTEC filein FASTQ format), an overlapping...

21
MAPLE Submission Data Maker Ver. 1.0 User’s Manual JAMSTEC 11/15/2018

Transcript of MAPLE Submission Data Maker Ver. 1.0 User’s Manual JAMSTEC filein FASTQ format), an overlapping...

MAPLE Submission Data Maker

Ver. 1.0

User’s Manual

JAMSTEC 11/15/2018

2

Contents

1. Introduction ............................................................................................................................................ 3 2. Operating environment ...................................................................................................................... 3 3. Overview of operation ....................................................................................................................... 3

3.1. Input data ....................................................................................................................................... 3 3.2. Output data .................................................................................................................................... 4 3.3. Overview of the analysis pipeline ........................................................................................ 4

4. MSDM server operations .................................................................................................................. 5 4.1. Starting up the MSDM server ................................................................................................ 5 4.2. Shutting down the MSDM server ......................................................................................... 7

5. Analysis pipeline operations ......................................................................................................... 10 5.1. Registering new jobs .............................................................................................................. 10 5.2. Running jobs .............................................................................................................................. 12 5.3. Setting and changing analysis parameters ...................................................................... 13 5.4. Displaying the job list ............................................................................................................ 14 5.5. Refreshing the job list ............................................................................................................ 16 5.6. Checking analysis status ....................................................................................................... 16 5.7. Downloading analysis results.............................................................................................. 18 5.8. Terminating an analysis ........................................................................................................ 19 5.9. Deleting analysis results........................................................................................................ 19

References..................................................................................................................................................... 21

3

1. Introduction This manual describes how to use MAPLE Submission Data Maker (simply referred to as “MSDM”

from here on). MSDM operates in a client-server system. A virtual server image of MSDM runs on

virtualization software as a server that can be controlled via a web browser so that input data for MAPLE

[1, 2] analysis can be created from Illumina sequence data. The figure below shows a schematic of the

MSDM execution environment.

2. Operating environment

The following table summarizes the operating environment of MSDM.

OS Windows 10 or Mac OS X CPU 4 threads or more Memory 16 GB or more Storage At least 256 GB of free space (SSD

recommended) Web browser Google Chrome or Firefox Virtualization software

VirtualBox 5.2.20 or later

3. Overview of operation MSDM provides the analysis pipeline for creating MAPLE input data and provides its user interface.

3.1. Input data MSDM accepts as input the read sequences of Illumina shotgun metagenomic sequencing that satisfy

certain conditions. Specifically, as shown in the figure below, it is required that a library length and read

length be designed such that the total length of read 1 and read 2 is longer than the library fragment length.

The analysis target is the sequence corresponding to the overall length of the library fragment, which is

obtained by merging read 1 and read 2 based on the overlapping sequence, so MAPLE input data can be

/

A

/

4

obtained only if there is overlap. As described in detail in Section 3.3 MAPLE input data are either a pair of

FASTQ files or a pair of gzip-compressed FASTQ files for each of the sequences of read 1 and read 2.

MSDM Input Read Conditions

3.2. Output data Output data are amino acid sequences encoded by predicted open reading frames (ORFs) in library

fragment sequences. In addition to complete ORFs that contain both the start codon and the stop codon,

partial ORFs, which have only one or neither of the start codon and stop codon, and are over a certain

length, are included in the output data as they are valid in MAPLE. The output data are amino acid

sequences in FASTA format.

If there are too many amino acid sequences, analysis by MAPLE becomes extremely burdensome. For

this reason, when the number of amino acid sequences exceeds a set threshold value, the output is

downsampled to reduce the number to the preset value.

3.3. Overview of the analysis pipeline The figure below shows an schematic diagram of the analysis pipeline.

(1) Using read 1 FASTQ files (forward reads in FASTQ format) and read 2 FASTQ files (reverse reads

in FASTQ format), an overlapping sequence of 10 bases or more from the 3′ end is found in order to

generate a merged sequence of each pair of reads. PEAR [3] is used for this step. Only merged

sequences are subjected to the analysis from here on.

Overview of the MSDM Pipeline

����� �

����

����

������

5

(2) The number of merged sequences is randomly downsampled as necessary so as not to exceed the

user-definable maximum number of sequences, and the resulting number of sequences is used for

analysis from here on.

(3) Sequences that do not satisfy a user-definable quality score are filtered out in this step. The

fastq_quality_filter of FASTX-toolkit [4] is used.

(4) Remaining sequences are mapped [5] to the Phix control sequence of the Illumina sequencer, and

filtered out.

(5) Identical sequences are judged to be PCR duplication, and are filtered out to eliminate duplicates.

PRINSEQ [6] is used for this step.

(6) MetaGeneAnnotator [7,8] is used to predict the ORFs for sequences remaining after the above steps.

(7) Translated amino acid sequences are generated from the ORFs predicted in (6).

(8) A custom script (AA_divide) is used to divide the translated amino acid sequences obtained in (7)

into partial ORF amino acid sequences and complete ORF amino acid sequences that each begin

with a start codon and end with a stop codon. In addition, partial ORF amino acid sequences of a

sequence length less than a user-definable threshold are filtered out. The complete ORF amino acid

sequences and remaining partial ORF amino acid sequences are output.

(9) A custom script (randompick) is used to randomly downsample the amino acid sequences output in

(8) as necessary so as not to exceed the user-definable maximum number of sequences, and a

FASTA file of the amino acid sequences is output as the final analysis result.

4. MSDM server operations 4.1. Starting up the MSDM server To use the MSDM pipeline, the MSDM server must already be running. Before starting up the MSDM

server after installation, we recommend changing the number of CPUs and amount of memory allocated to

the server to match the operating environment by the procedure described in Section 4.3.

Follow the procedure below to start up the MSDM server.

(1) Start up VirtualBox, and select the MSDM server from the list of virtual servers in the VirtualBox

Manager window. Check that the server is powered off, and then click the “Start” button. Startup

begins.

6

(2) After a while, startup ends, and the login screen shown below is displayed on the MSDM server console.

(3) After the server has started up, launch the web browser, and enter “localhost:30080/MSDM” as the

destination URL.

(4) When connection to the server is completed, the main screen of MSDM shown below is displayed on

the browser. MSDM is used via this screen. For details on how to use MSDM, see Section 5.

The figure below shows the screen in an initial state with no jobs registered. Normally, multiple jobs

that have been registered and executed are displayed in the job list in this screen.

7

When the MSDM server is running, take care to run as few other tasks as possible as the server uses

up a considerable amount of hardware resources.

4.2. Shutting down the MSDM server To end use of the MSDM pipeline, shut down the MSDM server. Follow the procedure below to shut

down the MSDM server. This is also an effective means of restarting the server, for example, when an error

has occurred during use of the MSDM pipeline and made the MSDM pipeline inoperable or when a

problem has occurred, such as failure to free up extra disk space even after deleting unwanted jobs.

(1) Connect to the MSDM server from the web browser, and check in the Status column that there are no

jobs in a Running status. Note, however, that this cannot be checked when the MSDM pipeline is

inoperable or some other error has occurred.

(2) Select Machine → Close → Power Off from the VirtualBox Manager window menu to shut down the

MSDM server, The MSDM server can also be shut down by clicking the power shutdown button

displayed at the top right of the login screen on the server console.

8

(3) When the MSDM server shuts down, the server console screen closes. In the VirtualBox Manager

window, make sure that the MSDM server status is “Powered Off”.

This completes the MSDM server shutdown. If necessary, also exit VirtualBox.

Changing server settings

The execution performance of the MSDM pipeline is strongly dependent on the number of CPUs and

amount of memory assigned to the MSDM server. In the initial state, recommended values are set for both

the number of CPUs and amount of memory. However, these settings can be changed to match the

specifications of the computer you are using. This section describes the procedure for changing these

settings.

(1) Start up VirtualBox, and select the MSDM server from the list of virtual servers in the VirtualBox

Manager window. Check that the MSDM server status is “Powered Off”, and then click the “Settings”

button.

9

(2) From the server settings screen, select the System icon.

(3) The amount of memory allocated to the server can be changed in the Base Memory field displayed in

the Motherboard panel. After changing to the desired value, click the “OK” button. The new value for

the amount of memory allocated to the server will be reflected when the server is next started up. To

change the number of CPUs, select the Processor tab.

10

(4) Change the number of CPUs in the Processor(s) field, and click the “OK” button. After changing the

number of CPUs, the server will start up with the new virtual hardware configuration when the server

is next started up.

5. Analysis pipeline operations

This chapter describes the various operational procedures of executing and controlling the analysis

pipeline, and checking and acquiring the analysis results. These procedures are performed via the user

interface of the MSDM pipeline.

5.1. Registering new jobs As preparation before executing analysis, input files and execution parameters, for example, must be

designated, and jobs must be registered. To register a job, click the “New Job” button.

The following dialog for registering a new job is displayed.

11

To register a job, enter the following information in the dialog, and click the "Upload” button.

(a) Title (required)

Designate the title of the job. The title is used as the prefix of the output file name, so enter an

identifier that does not include any spaces. Preferably, the identifier should allow the user to easily

identify the project name and sample name.

(b) Description (required)

Enter any job-related explanation or comment.

(c) FASTQ file (required)

In each of Forward (Read 1) and Reverse (Read 2), enter the FASTQ file of the Paired-End reads to be

used as inputs. Clicking the “Choose File” button displays the corresponding file selection dialog. Either

FASTQ files or gzip-compressed FASTQ files can be selected. However, the same format of file must be

selected for both Forward and Reverse.

Also, small FASTQ files take less time to upload to the server. For this reason, try to specify files in

gzip-compressed format. (FASTQ files output from the Illumina sequencer also are in the gzip-compressed

format and normally have a name that ends with “fastq.gz”.)

When the “Show” button to the right of the Parameters item is clicked, the parameters setting form is

displayed. The analysis parameters of the registered job can be changed by editing the parameter values in

this form.

12

If necessary, change the parameters by editing the values in the Value column. Parameters can be reverted

to their default values by clicking the default parameter display in the Default column of a corresponding

line during editing. For an explanation of each parameter, see Section 5.3.

After all required settings are completed, click the “Upload” button to start the file upload. Note that files

exceeding 10 GB in size cannot be uploaded.

Also, note that it will take an excessive amount of time to upload a designated file if it is larger than the

amount of memory on the server. The amount of memory on the server can be checked in the display at the

top right of the job list screen.

When the file upload and job registration are completed, and a job can be run, the new job will appear in

the job list. At this time, “Only uploaded” is displayed in the Status field for the job in the job list.

5.2. Running jobs When registration is completed, the “Run” button will appear in the Command column of the

corresponding job in the job list.

Click the “Run” command button. The Run(Job) confirmation dialog shown below is displayed.

13

To re-check and change the analysis parameters again before running the job, select the “Show” button to the

right of the Parameters item.

The current settings and default values of each parameter are displayed. Parameters can be changed by

editing the values in the Value column. For an explanation of each parameter, see Section 5.3.

To run a job, click the “Run” button. The Status field in the job list then changes to “Running...”.

5.3. Setting and changing analysis parameters

The following table summarizes the analysis parameters that can be set and explains each of the

parameters.

14

Command Parameter Description pear -j The number of threads of the PEAR command

→ Do not set a value larger than the number of CPUs on

the server.

pear lenMin Minimum value of the sequence length after merging

Sequences shorter than this are filtered out.

pear #seqMax Maximum number of sequences after merging

Sequences are randomly downsampled when the number

of sequences exceeds this value.

fastq_quality_filter -p Sequences with bases present in the designated

percentage or more that do not exceed the quality score

designated by -q are filtered out.

fastq_quality_filter -q Minimum quality score

AA_divide aalen Minimum length of partial ORF amino acid sequences

Partial ORF amino acid sequences shorter than this are

filtered out.

randompick #seq Maximum number of amino acid sequences that are

output

Amino acid sequences are randomly downsampled when

the number of sequences exceeds this value.

These parameters can be set when registering a job, running a job, or re-running a job.

5.4. Displaying the job list The job list is displayed in the main screen of this system. The job list displays input data in chronological

order starting with the most recently completed upload.

The top right of the list displays the amount of storage used and maximum storage of the MSDM

server and the amount of memory used and allocated as of the last time the screen was refreshed.

Take appropriate measures such as limiting running of jobs and deleting analysis data to prevent an

excessive load from being placed on the server.

The following table summarizes the items displayed in the job list.

No. Item name Description 1 ID Automatically assigned job ID

15

2 Title Title entered by the user at job registration

3 Description Description entered by the user at job registration

4 Input reads File name of each read data entered by the user at job

registration

5 Period First row displays the job start time, and second row

displays the job completion time.

6 Run time [sec] Time taken to run the job. If a job is running, the time

elapsed since the start of the job is displayed.

7 Status Displays the job status. The job status is one of the

following.

•Only uploaded: File upload completed

•Running...: Job currently running

•Terminated: Either the job was stopped midway or a

server shutdown stopped the job

•Complete: Job ended normally

•Error: Job that were run but ended in error

8 Parameters Button for displaying the analysis parameters

9 Statistics This sequentially displays the statistical information

of each step in the analysis pipeline according to the

state of progress. The number of reads/number of

sequences and base length of nucleic acid sequences,

the number of amino acid sequences and sequence

lengths, and other information are displayed.

10 Result Displays the button for downloading the amino acid

sequences in the analysis result after analysis ends

normally.

11 Sequence length distribution

Merged reads: Button for displaying a histogram of

the sequence lengths after merging of read 1 and read

2

Result: Button for displaying a histogram of the

sequence lengths of the amino acid sequences in the

final analysis result

12 Command Displays the various command execution buttons for

jobs.

Run: Runs a job

Rerun: Runs a job again (analysis result is

overwritten)

Terminate: Terminates a running job

Delete: Deletes a job (including accompanying data)

16

5.5. Refreshing the job list To refresh the display of the job list to its latest state, click either the “Reload” button at the top of

the job list or the system title at the top left of the screen.

When there are still jobs in the job list whose Status is indicated as “Running...”, the job list can also be

refreshed by clicking the “Running...” indication.

5.6. Checking analysis status

The analysis status (analysis parameters, various statistical information, etc.) of jobs that are currently

being analyzed or jobs that have been analyzed can be checked at any time.

(a) Checking analysis parameters

To check analysis parameters, click the “Show” button in the Parameters field of the desired job in the

job list. A popup window containing the analysis parameters is displayed. The popup window can be

moved by dragging it. To close the popup window, click the “x” button at the top left.

(b) Checking statistical information at each analysis step

To check the statistical information (number of reads/number of sequences, read length/sequence

length, etc.) at each analysis step, click the “Detail” button in the Statistics column of the desired job.

The statistical information of steps that have been analyzed so far for a job currently being analyzed

can be checked.

17

(c) Checking the distribution of sequence lengths of merged reads

When merging of read 1 and read 2 by the PEAR command is completed, a histogram comprising the

lengths of the merged sequences can be displayed by clicking the “Merged reads” button in the Sequence

length distribution field. The histogram can be moved by dragging it, and can be closed by clicking

inside the histogram.

Completion of the analysis by the PEAR command can be checked by clicking the “Detail” button in

the Statistics field. If completed, the statistical information of pear (limited) is displayed.

The X-axis of the histogram shows the length of the merged sequences, and the Y-axis shows the number

of sequences. The blue line shows the histogram of all sequences that were merged by the PEAR command,

and the red line shows the histogram of the sequence lengths filtering/downsampling according to the

threshold values of sequence length and number of sequences. We recommend using this histogram to check

whether the distribution of sequence lengths is consistent with the library distribution of the expected size.

If an abnormality is found at this stage, it may be necessary to consider reviewing analysis parameters or

redoing the sequencing, for example. For details on terminating or re-analyzing currently running jobs, see

Section 5.8.

(d) Checking the distribution of amino acid sequence lengths in the analysis result

When analysis ends normally, the “Result” button is displayed in the Sequence length distribution field. A

histogram of the amino acid sequence lengths in the final analysis result can be checked by clicking this

18

button. The X-axis of the histogram shows the length of the amino acid sequences, and the Y-axis shows the

number of sequences. A greater number of long sequences indicates higher reliability of MAPLE analysis.

5.7. Downloading analysis results When a job ends normally, the “FASTA” button will appear in the Result column.

Clicking the “FASTA” button downloads the FASTA file of the amino acid sequence. This file is the

MAPLE input file. Before running analysis with MAPLE, we recommend displaying and checking the

various statistical information by the methods described in Section 5.6, and checking whether the analysis

result is appropriate.

19

5.8. Terminating an analysis To terminate a running job, click the “Terminate” button in the Command field of the currently running

job.

The job termination confirmation dialog is displayed. Select “OK”.

When a job is terminated, the Status field in the job list changes to “Terminated”. In this state, the job

can be run again from the beginning by clicking the “Rerun” button in the Command field.

5.9. Deleting analysis results To free up space on the server disk, analysis results and input data must be deleted. To do this, click the

“Delete” button of the job to be deleted in the Command field in the job list.

20

The data deletion confirmation dialog is displayed. Select “OK”.

Once data are deleted, the corresponding job disappears from the job list.

References

[1] Arai, W., Taniguchi, T., Goto, S., Moriya, Y., Uehara H., Takemoto, K., et al. (2018). MAPLE 2.3.0: an improved system for evaluating the functionomes of genomes and metagenomes. Bioscience,

Biotechnology, and Biochemistry, 82(9), 1515–1517. [2] Takami, H., Taniguchi, T., Arai, W., Takemoto, K., Moriya, Y., Goto, S. (2016); An automated system for evaluation of the potential functionome: MAPLE version 2.1.0, DNA Research, 23(5), 467–475. [3] Zhang, J., Kobert, K., Flouri, T., and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics, 30(5), 614–620. [4] http://hannonlab.cshl.edu/fastx_toolkit/ [5] Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. [6] Schmieder, R., and Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863–864. [7] Noguchi, H., Taniguchi, T., & Itoh, T. (2008). MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes. DNA Research, 15(6), 387–396. [8] Noguchi, H., Park, J., & Takagi, T. (2006). MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research, 34(19), 5623–5630.