MAPLE Submission Data Maker Ver. 1.0 User’s Manual JAMSTEC filein FASTQ format), an overlapping...
Transcript of MAPLE Submission Data Maker Ver. 1.0 User’s Manual JAMSTEC filein FASTQ format), an overlapping...
2
Contents
1. Introduction ............................................................................................................................................ 3 2. Operating environment ...................................................................................................................... 3 3. Overview of operation ....................................................................................................................... 3
3.1. Input data ....................................................................................................................................... 3 3.2. Output data .................................................................................................................................... 4 3.3. Overview of the analysis pipeline ........................................................................................ 4
4. MSDM server operations .................................................................................................................. 5 4.1. Starting up the MSDM server ................................................................................................ 5 4.2. Shutting down the MSDM server ......................................................................................... 7
5. Analysis pipeline operations ......................................................................................................... 10 5.1. Registering new jobs .............................................................................................................. 10 5.2. Running jobs .............................................................................................................................. 12 5.3. Setting and changing analysis parameters ...................................................................... 13 5.4. Displaying the job list ............................................................................................................ 14 5.5. Refreshing the job list ............................................................................................................ 16 5.6. Checking analysis status ....................................................................................................... 16 5.7. Downloading analysis results.............................................................................................. 18 5.8. Terminating an analysis ........................................................................................................ 19 5.9. Deleting analysis results........................................................................................................ 19
References..................................................................................................................................................... 21
3
1. Introduction This manual describes how to use MAPLE Submission Data Maker (simply referred to as “MSDM”
from here on). MSDM operates in a client-server system. A virtual server image of MSDM runs on
virtualization software as a server that can be controlled via a web browser so that input data for MAPLE
[1, 2] analysis can be created from Illumina sequence data. The figure below shows a schematic of the
MSDM execution environment.
2. Operating environment
The following table summarizes the operating environment of MSDM.
OS Windows 10 or Mac OS X CPU 4 threads or more Memory 16 GB or more Storage At least 256 GB of free space (SSD
recommended) Web browser Google Chrome or Firefox Virtualization software
VirtualBox 5.2.20 or later
3. Overview of operation MSDM provides the analysis pipeline for creating MAPLE input data and provides its user interface.
3.1. Input data MSDM accepts as input the read sequences of Illumina shotgun metagenomic sequencing that satisfy
certain conditions. Specifically, as shown in the figure below, it is required that a library length and read
length be designed such that the total length of read 1 and read 2 is longer than the library fragment length.
The analysis target is the sequence corresponding to the overall length of the library fragment, which is
obtained by merging read 1 and read 2 based on the overlapping sequence, so MAPLE input data can be
/
A
/
4
obtained only if there is overlap. As described in detail in Section 3.3 MAPLE input data are either a pair of
FASTQ files or a pair of gzip-compressed FASTQ files for each of the sequences of read 1 and read 2.
MSDM Input Read Conditions
3.2. Output data Output data are amino acid sequences encoded by predicted open reading frames (ORFs) in library
fragment sequences. In addition to complete ORFs that contain both the start codon and the stop codon,
partial ORFs, which have only one or neither of the start codon and stop codon, and are over a certain
length, are included in the output data as they are valid in MAPLE. The output data are amino acid
sequences in FASTA format.
If there are too many amino acid sequences, analysis by MAPLE becomes extremely burdensome. For
this reason, when the number of amino acid sequences exceeds a set threshold value, the output is
downsampled to reduce the number to the preset value.
3.3. Overview of the analysis pipeline The figure below shows an schematic diagram of the analysis pipeline.
(1) Using read 1 FASTQ files (forward reads in FASTQ format) and read 2 FASTQ files (reverse reads
in FASTQ format), an overlapping sequence of 10 bases or more from the 3′ end is found in order to
generate a merged sequence of each pair of reads. PEAR [3] is used for this step. Only merged
sequences are subjected to the analysis from here on.
Overview of the MSDM Pipeline
����� �
����
����
������
5
(2) The number of merged sequences is randomly downsampled as necessary so as not to exceed the
user-definable maximum number of sequences, and the resulting number of sequences is used for
analysis from here on.
(3) Sequences that do not satisfy a user-definable quality score are filtered out in this step. The
fastq_quality_filter of FASTX-toolkit [4] is used.
(4) Remaining sequences are mapped [5] to the Phix control sequence of the Illumina sequencer, and
filtered out.
(5) Identical sequences are judged to be PCR duplication, and are filtered out to eliminate duplicates.
PRINSEQ [6] is used for this step.
(6) MetaGeneAnnotator [7,8] is used to predict the ORFs for sequences remaining after the above steps.
(7) Translated amino acid sequences are generated from the ORFs predicted in (6).
(8) A custom script (AA_divide) is used to divide the translated amino acid sequences obtained in (7)
into partial ORF amino acid sequences and complete ORF amino acid sequences that each begin
with a start codon and end with a stop codon. In addition, partial ORF amino acid sequences of a
sequence length less than a user-definable threshold are filtered out. The complete ORF amino acid
sequences and remaining partial ORF amino acid sequences are output.
(9) A custom script (randompick) is used to randomly downsample the amino acid sequences output in
(8) as necessary so as not to exceed the user-definable maximum number of sequences, and a
FASTA file of the amino acid sequences is output as the final analysis result.
4. MSDM server operations 4.1. Starting up the MSDM server To use the MSDM pipeline, the MSDM server must already be running. Before starting up the MSDM
server after installation, we recommend changing the number of CPUs and amount of memory allocated to
the server to match the operating environment by the procedure described in Section 4.3.
Follow the procedure below to start up the MSDM server.
(1) Start up VirtualBox, and select the MSDM server from the list of virtual servers in the VirtualBox
Manager window. Check that the server is powered off, and then click the “Start” button. Startup
begins.
6
(2) After a while, startup ends, and the login screen shown below is displayed on the MSDM server console.
(3) After the server has started up, launch the web browser, and enter “localhost:30080/MSDM” as the
destination URL.
(4) When connection to the server is completed, the main screen of MSDM shown below is displayed on
the browser. MSDM is used via this screen. For details on how to use MSDM, see Section 5.
The figure below shows the screen in an initial state with no jobs registered. Normally, multiple jobs
that have been registered and executed are displayed in the job list in this screen.
7
When the MSDM server is running, take care to run as few other tasks as possible as the server uses
up a considerable amount of hardware resources.
4.2. Shutting down the MSDM server To end use of the MSDM pipeline, shut down the MSDM server. Follow the procedure below to shut
down the MSDM server. This is also an effective means of restarting the server, for example, when an error
has occurred during use of the MSDM pipeline and made the MSDM pipeline inoperable or when a
problem has occurred, such as failure to free up extra disk space even after deleting unwanted jobs.
(1) Connect to the MSDM server from the web browser, and check in the Status column that there are no
jobs in a Running status. Note, however, that this cannot be checked when the MSDM pipeline is
inoperable or some other error has occurred.
(2) Select Machine → Close → Power Off from the VirtualBox Manager window menu to shut down the
MSDM server, The MSDM server can also be shut down by clicking the power shutdown button
displayed at the top right of the login screen on the server console.
8
(3) When the MSDM server shuts down, the server console screen closes. In the VirtualBox Manager
window, make sure that the MSDM server status is “Powered Off”.
This completes the MSDM server shutdown. If necessary, also exit VirtualBox.
Changing server settings
The execution performance of the MSDM pipeline is strongly dependent on the number of CPUs and
amount of memory assigned to the MSDM server. In the initial state, recommended values are set for both
the number of CPUs and amount of memory. However, these settings can be changed to match the
specifications of the computer you are using. This section describes the procedure for changing these
settings.
(1) Start up VirtualBox, and select the MSDM server from the list of virtual servers in the VirtualBox
Manager window. Check that the MSDM server status is “Powered Off”, and then click the “Settings”
button.
9
(2) From the server settings screen, select the System icon.
(3) The amount of memory allocated to the server can be changed in the Base Memory field displayed in
the Motherboard panel. After changing to the desired value, click the “OK” button. The new value for
the amount of memory allocated to the server will be reflected when the server is next started up. To
change the number of CPUs, select the Processor tab.
10
(4) Change the number of CPUs in the Processor(s) field, and click the “OK” button. After changing the
number of CPUs, the server will start up with the new virtual hardware configuration when the server
is next started up.
5. Analysis pipeline operations
This chapter describes the various operational procedures of executing and controlling the analysis
pipeline, and checking and acquiring the analysis results. These procedures are performed via the user
interface of the MSDM pipeline.
5.1. Registering new jobs As preparation before executing analysis, input files and execution parameters, for example, must be
designated, and jobs must be registered. To register a job, click the “New Job” button.
The following dialog for registering a new job is displayed.
�
11
To register a job, enter the following information in the dialog, and click the "Upload” button.
(a) Title (required)
Designate the title of the job. The title is used as the prefix of the output file name, so enter an
identifier that does not include any spaces. Preferably, the identifier should allow the user to easily
identify the project name and sample name.
(b) Description (required)
Enter any job-related explanation or comment.
(c) FASTQ file (required)
In each of Forward (Read 1) and Reverse (Read 2), enter the FASTQ file of the Paired-End reads to be
used as inputs. Clicking the “Choose File” button displays the corresponding file selection dialog. Either
FASTQ files or gzip-compressed FASTQ files can be selected. However, the same format of file must be
selected for both Forward and Reverse.
Also, small FASTQ files take less time to upload to the server. For this reason, try to specify files in
gzip-compressed format. (FASTQ files output from the Illumina sequencer also are in the gzip-compressed
format and normally have a name that ends with “fastq.gz”.)
When the “Show” button to the right of the Parameters item is clicked, the parameters setting form is
displayed. The analysis parameters of the registered job can be changed by editing the parameter values in
this form.
12
If necessary, change the parameters by editing the values in the Value column. Parameters can be reverted
to their default values by clicking the default parameter display in the Default column of a corresponding
line during editing. For an explanation of each parameter, see Section 5.3.
After all required settings are completed, click the “Upload” button to start the file upload. Note that files
exceeding 10 GB in size cannot be uploaded.
Also, note that it will take an excessive amount of time to upload a designated file if it is larger than the
amount of memory on the server. The amount of memory on the server can be checked in the display at the
top right of the job list screen.
When the file upload and job registration are completed, and a job can be run, the new job will appear in
the job list. At this time, “Only uploaded” is displayed in the Status field for the job in the job list.
5.2. Running jobs When registration is completed, the “Run” button will appear in the Command column of the
corresponding job in the job list.
Click the “Run” command button. The Run(Job) confirmation dialog shown below is displayed.
13
To re-check and change the analysis parameters again before running the job, select the “Show” button to the
right of the Parameters item.
The current settings and default values of each parameter are displayed. Parameters can be changed by
editing the values in the Value column. For an explanation of each parameter, see Section 5.3.
To run a job, click the “Run” button. The Status field in the job list then changes to “Running...”.
5.3. Setting and changing analysis parameters
The following table summarizes the analysis parameters that can be set and explains each of the
parameters.
14
Command Parameter Description pear -j The number of threads of the PEAR command
→ Do not set a value larger than the number of CPUs on
the server.
pear lenMin Minimum value of the sequence length after merging
Sequences shorter than this are filtered out.
pear #seqMax Maximum number of sequences after merging
Sequences are randomly downsampled when the number
of sequences exceeds this value.
fastq_quality_filter -p Sequences with bases present in the designated
percentage or more that do not exceed the quality score
designated by -q are filtered out.
fastq_quality_filter -q Minimum quality score
AA_divide aalen Minimum length of partial ORF amino acid sequences
Partial ORF amino acid sequences shorter than this are
filtered out.
randompick #seq Maximum number of amino acid sequences that are
output
Amino acid sequences are randomly downsampled when
the number of sequences exceeds this value.
These parameters can be set when registering a job, running a job, or re-running a job.
5.4. Displaying the job list The job list is displayed in the main screen of this system. The job list displays input data in chronological
order starting with the most recently completed upload.
The top right of the list displays the amount of storage used and maximum storage of the MSDM
server and the amount of memory used and allocated as of the last time the screen was refreshed.
Take appropriate measures such as limiting running of jobs and deleting analysis data to prevent an
excessive load from being placed on the server.
The following table summarizes the items displayed in the job list.
No. Item name Description 1 ID Automatically assigned job ID
15
2 Title Title entered by the user at job registration
3 Description Description entered by the user at job registration
4 Input reads File name of each read data entered by the user at job
registration
5 Period First row displays the job start time, and second row
displays the job completion time.
6 Run time [sec] Time taken to run the job. If a job is running, the time
elapsed since the start of the job is displayed.
7 Status Displays the job status. The job status is one of the
following.
•Only uploaded: File upload completed
•Running...: Job currently running
•Terminated: Either the job was stopped midway or a
server shutdown stopped the job
•Complete: Job ended normally
•Error: Job that were run but ended in error
8 Parameters Button for displaying the analysis parameters
9 Statistics This sequentially displays the statistical information
of each step in the analysis pipeline according to the
state of progress. The number of reads/number of
sequences and base length of nucleic acid sequences,
the number of amino acid sequences and sequence
lengths, and other information are displayed.
10 Result Displays the button for downloading the amino acid
sequences in the analysis result after analysis ends
normally.
11 Sequence length distribution
Merged reads: Button for displaying a histogram of
the sequence lengths after merging of read 1 and read
2
Result: Button for displaying a histogram of the
sequence lengths of the amino acid sequences in the
final analysis result
12 Command Displays the various command execution buttons for
jobs.
Run: Runs a job
Rerun: Runs a job again (analysis result is
overwritten)
Terminate: Terminates a running job
Delete: Deletes a job (including accompanying data)
16
5.5. Refreshing the job list To refresh the display of the job list to its latest state, click either the “Reload” button at the top of
the job list or the system title at the top left of the screen.
When there are still jobs in the job list whose Status is indicated as “Running...”, the job list can also be
refreshed by clicking the “Running...” indication.
5.6. Checking analysis status
The analysis status (analysis parameters, various statistical information, etc.) of jobs that are currently
being analyzed or jobs that have been analyzed can be checked at any time.
(a) Checking analysis parameters
To check analysis parameters, click the “Show” button in the Parameters field of the desired job in the
job list. A popup window containing the analysis parameters is displayed. The popup window can be
moved by dragging it. To close the popup window, click the “x” button at the top left.
(b) Checking statistical information at each analysis step
To check the statistical information (number of reads/number of sequences, read length/sequence
length, etc.) at each analysis step, click the “Detail” button in the Statistics column of the desired job.
The statistical information of steps that have been analyzed so far for a job currently being analyzed
can be checked.
17
(c) Checking the distribution of sequence lengths of merged reads
When merging of read 1 and read 2 by the PEAR command is completed, a histogram comprising the
lengths of the merged sequences can be displayed by clicking the “Merged reads” button in the Sequence
length distribution field. The histogram can be moved by dragging it, and can be closed by clicking
inside the histogram.
Completion of the analysis by the PEAR command can be checked by clicking the “Detail” button in
the Statistics field. If completed, the statistical information of pear (limited) is displayed.
The X-axis of the histogram shows the length of the merged sequences, and the Y-axis shows the number
of sequences. The blue line shows the histogram of all sequences that were merged by the PEAR command,
and the red line shows the histogram of the sequence lengths filtering/downsampling according to the
threshold values of sequence length and number of sequences. We recommend using this histogram to check
whether the distribution of sequence lengths is consistent with the library distribution of the expected size.
If an abnormality is found at this stage, it may be necessary to consider reviewing analysis parameters or
redoing the sequencing, for example. For details on terminating or re-analyzing currently running jobs, see
Section 5.8.
(d) Checking the distribution of amino acid sequence lengths in the analysis result
When analysis ends normally, the “Result” button is displayed in the Sequence length distribution field. A
histogram of the amino acid sequence lengths in the final analysis result can be checked by clicking this
18
button. The X-axis of the histogram shows the length of the amino acid sequences, and the Y-axis shows the
number of sequences. A greater number of long sequences indicates higher reliability of MAPLE analysis.
5.7. Downloading analysis results When a job ends normally, the “FASTA” button will appear in the Result column.
Clicking the “FASTA” button downloads the FASTA file of the amino acid sequence. This file is the
MAPLE input file. Before running analysis with MAPLE, we recommend displaying and checking the
various statistical information by the methods described in Section 5.6, and checking whether the analysis
result is appropriate.
19
5.8. Terminating an analysis To terminate a running job, click the “Terminate” button in the Command field of the currently running
job.
The job termination confirmation dialog is displayed. Select “OK”.
When a job is terminated, the Status field in the job list changes to “Terminated”. In this state, the job
can be run again from the beginning by clicking the “Rerun” button in the Command field.
5.9. Deleting analysis results To free up space on the server disk, analysis results and input data must be deleted. To do this, click the
“Delete” button of the job to be deleted in the Command field in the job list.
20
The data deletion confirmation dialog is displayed. Select “OK”.
Once data are deleted, the corresponding job disappears from the job list.
References
[1] Arai, W., Taniguchi, T., Goto, S., Moriya, Y., Uehara H., Takemoto, K., et al. (2018). MAPLE 2.3.0: an improved system for evaluating the functionomes of genomes and metagenomes. Bioscience,
Biotechnology, and Biochemistry, 82(9), 1515–1517. [2] Takami, H., Taniguchi, T., Arai, W., Takemoto, K., Moriya, Y., Goto, S. (2016); An automated system for evaluation of the potential functionome: MAPLE version 2.1.0, DNA Research, 23(5), 467–475. [3] Zhang, J., Kobert, K., Flouri, T., and Stamatakis, A. (2014). PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics, 30(5), 614–620. [4] http://hannonlab.cshl.edu/fastx_toolkit/ [5] Langmead, B., and Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4), 357–359. [6] Schmieder, R., and Edwards, R. (2011). Quality control and preprocessing of metagenomic datasets. Bioinformatics, 27(6), 863–864. [7] Noguchi, H., Taniguchi, T., & Itoh, T. (2008). MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes. DNA Research, 15(6), 387–396. [8] Noguchi, H., Park, J., & Takagi, T. (2006). MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Nucleic Acids Research, 34(19), 5623–5630.