Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a...

14
Virus Detect for Windows Manual 1. Virus Detect for Windows 2. Requirements 3. Installation 4. Databases 5. Host reference 6. Control sequence 7. Analysis 8. Results 9. Troubleshooting

Transcript of Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a...

Page 1: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Virus Detect for Windows

Manual

1. Virus Detect for Windows 2. Requirements 3. Installation 4. Databases 5. Host reference 6. Control sequence 7. Analysis 8. Results 9. Troubleshooting

Page 2: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

1. Virus Detect for Windows

Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different organism under Windows OS platform. VDW was developed as an interface compatible with Windows OS on top of Virus Detect Linux (Zheng, 2017). VDW also have new features as control sequences and spike sequences analysis as part of the automated pipeline.

2. Requirements

• Windows 10 - 64 bits • Java 1.8

3. Installation

Virus detect for Windows can be downloaded from https://research.cip.cgiar.org/virusdetect/ Copy the downloaded zip file in a folder of your choice (unit D: desirable). Then, extract the file in that folder. Open the folder VDW and you will find the executable VDW.exe. Doble click on it to open. You can also create a Desktop shortcut: Right click on it and select Send to -> Desktop (create shortcut)

4. Databases VDW uses a reference sequence database for reference-guided assembly of reads as well as identify viruses from assembled contigs. Before starting any analysis, get databases in place. VDW offers the option to download / update precompiled databases from the VirusDetect ftp site, or to create a custom database for your specific purposes. a. Download/update NCBI database: This option enables you to download classified and non-redundant databases available on the VirusDetect ftp site. This virus reference database is generated from the GenBank virus database (gbvrl) and updated on an occasional basis. Subdividing the different virus sequences by host reduces their size and increases the speed of analysis.

Page 3: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Current local NCBI Database version installed on this computer: It displays the databases versions installed in this computer; you do not need to fill the field. If blank, there is no database installed in this computer yet. Make sure to compare current local database and NCBI database version before you update the NCBI database. Update to NCBI Database version: Latest precompiled available version is automatically displayed. This field do not need to be filled. Note: If blank, make sure “perlfiles” folder is present and has no suffer any modification. Select virus host Organism: The virus sequence databases have been classified into different kingdoms including plant, vertebrate, invertebrate, fungus, bacteria, algae, archaea and protozoa using a Virus Classification Pipeline. Choose according to your analysis. Select level of redundancy reduction for virus sequences (% identity): Unique virus sequence databases were generated for each host kingdom by removing redundant sequences of 100%, 97% and 95% identity, respectively. Reducing the redundancy level reduces the database size, increasing the speed of analysis as well as complexity of results.

Page 4: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Download/update NCBI database

b. Updating local database:

This option enables you to create your own custom virus sequence database and enables you to focus on viruses of your specific interest. To load a custom virus sequence database into VDW, four files must first be created: 1) Nucleotide FASTA file, with extension .fa or .fasta 2) Protein FASTA file containing translated protein sequences from the Nucleotide fasta file in the same order, with extension .fa or .fasta 3) An ID file containing the ID of nucleotide fasta and translated protein sequence in two separate columns on each line. 4) A sequence information file with seven collumns for each sequence containing sequence ID, sequence length, genus, virus name, number, host type, source of classification.

Page 5: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Load custom database files

Having those files ready, choose local DB from the configuration menu to submit these files as shown in the figure.

Load custom database

Page 6: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

5. Host reference

VirusDetect can use a sequence of the virus host to subtract reads with homology to the host, thus removing host derived sequences and reducing the amount of sequences for assembly and virus identification. This can speed up the analysis, remove unspecific hits caused by genome integrated viral sequences and increase the chances of identifying foreign sequences among “undetermined contigs”. The host reference should have a fasta or gz file extension. If you can upload different host references for different analysis.

Upload host reference

6. Control sequences

Cross sample contamination can be a problem when running several indexed samples in a same lane of a flow cell. To control for cross-sample contamination and adjust the default threshold of average read depth and coverage for virus Detection by VDW, one can include a control sample containing a known virus at high concentration and unrelated to the sample under investigation. Upload your control sequence in Fasta file format.

7. Analysis

Analysis parameters description:

Page 7: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Project folder name: Provide a name for the folder in which the result of this run will be saved. Enter an alphanumeric name without spaces or special characters. Fastq files: One or more files can be selected and will all be saved in the run folder given in the previous option. Sequences can be pre-cleaned and trimmed with other software, or raw uncleaned. I the latter case you will need to provide parameters for “Sequence of the adaptor” and “Minimum length of sequence after trimming”. In all cases Fast quality check files will be generated of the input files and cleaned reads files. Additional optional analysis include spike-in sequence analysis. Select one or more fastq files. It can also be compress gz files. Sequence cleaning & trimming (optional in case of raw reads). If raw reads are provided VDW can clean and trim sequence libraries, you will have to provide the sequencing adaptor and optionally minimum and maximum sequence length after trimming.

Sequences of the adaptor. Provide the sequence for the library adaptor sequence, commonly used adaptor sequences are provided in the VDW user manual. Ex . CAGATCGGAAGAGCACA Minimum length of sequence after trimming. A minimum (and maximum) length for sequences after trimming can be provided, this is useful when you want to consider only true siRNA (21-24nts) or avoid extremely short sequences in your analysis, reducing read numbers and analysis time.

Library quality control (optional if controls are included)

Enter spike sequences Synthetic RNA spike-in sequences can be added to samples before (recommended to control the complete library prep process), or after RNA extraction. Recovery of spike-in sequences among the read of your sample library will provide an indication of the efficiency of library preparation and enable absolute quantification of (small) RNA amounts in the sample. Example: ATGGAGCCAGTTC

GGACTCATTACGG GCTTGCCGATGAA AGCACTCTGGGAT ATGCTGGACCATG AAGCCTGCGTATG GAGCGTCCGATAT CCGGGATCGTTAA CGATATGCCTGGACG

Select control sequence Cross sample contamination can be a problem when running several indexed samples in a same lane of a flow cell. To control for cross-sample contamination

Page 8: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

and adjust the default threshold of average read depth and coverage for virus Detection by VDW, one can include a control sample containing a known virus at high concentration and unrelated to the sample under investigation. If included VWD will provide a separate output on genome coverage and sequencing depth of the control over all samples selected for the run. Define control sample. Here you can enter your sample name containing the control.

Virus database (required): Select one of the virus sequence databases you have previously loaded into VDW in the ‘Config’ section 4. Host sequence (Optional): A host reference sequence database, if available and loaded into VDW ‘Config’ section 5, can be used to subtract RNA reads derived from the host, thus reducing the number of non-relevant reads for virus detection and consequently computing time. This can speed up the analysis, remove unspecific hits caused by genome integrated viral sequences and increase the chances of identifying foreign sequences among “undetermined contigs”. Number of threads (Optional): Here the number of processor threads for running the software can be provided. The more threads provided, the more parallel processes the software can run, and the faster it will run, particularly if many samples have been provided as input simultaneously. The number of threads you can assign will depend on the number of CPUs of your computer and the number of cores on each CPU. Default 1. Additional parameters (Optional): VirusDetect runs with default parameters, which have been determined to be optimal over a large range of samples to ensure maximum accuracy and avoid false positive discovery due to cross sample contamination. However, variation may exist between samples from different host, library preparation approaches and sequencing platforms, or the level of sensitivity individual scientist require. Thus, some VDW enables some of the key parameters to be adjusted according to the user’s needs. These parameters and their default value can be found here, and can be adjusted by copying the parameter modifiers and providing the adjusted value. Each parameter needs to be provided with a space in between. See Annex 1. Example: --word_size 13 --min_overlap 20

Page 9: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Analysis form

Page 10: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

8. Results

Results are found in the results folder inside the VirusDetectWin folder. Results are same as VirusDetect Linux http://virusdetect.feilab.net/cgi-bin/virusdetect/vd_help.cgi

Blastn/Blastx results:

Additional QC results: VDW have some additional results according the analysis performed.

FastQC results: FastQC is performed before and after sequence trimming.

Page 11: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Sequence read length distribution:

Cleaning statistics:

Page 12: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Spike sequences abundance:

Summary for sequence cleaning, quality control and control sequences:

Summary for sequence cleaning, quality control and control sequences are reporter in the file Summary.txt. this files contains the following colummns:

File name #Raw reads #clean reads #21 reads #22 reads #23 reads #24 reads Spike1 Spike2 Spike n Control coverage Normalized control depth Normalized depth/kb control coverage # Reads mapped to control % Reads mapped to control

Page 13: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

9. Troubleshooting

Do not include spaces in folders or files names. Do no delete any file outside results folder.

Page 14: Virus Detect for Windows Manual - CGIAR · 2019. 11. 20. · Virus Detect for Windows (VDW) is a portable toolkit for the local and automated analyses of virus infection in different

Annex 1. Additional parameter from Virus Detect (Zhen,2017): BWA alignment parameters (alignments of reads to reference viruses or host sequences) --max_dist [Integer] Maximum edit distance [1] --max_open [Integer] Maximum number of gap opens [1] --max_extension [Integer] Maximum number of gap extensions [1] --len_seed [Integer] Seed length [15] --dist_seed [Integer] Maximum edit distance in the seed [1] HISAT options (align RNA-Seq reads to host references) --hisat_dist [Integer] Maximum edit distance for HISAT [5] Blast alignment options (remove redundancy within virus contigs) --min_overlap [Integer] Minimum overlap length [30] --max_end_clip [Integer] Maximum length of end clips [6] --min_identify [Float] Minimum percent identity [97] --mis_penalty [Integer] Penalty score for a nucleotide mismatch [-3] --gap_cost [Integer] Cost to open a gap [-1] --gap_extension [Integer] Cost to extend a gap [-1] Blast alignment options (align virus contigs to virus reference database) --word_size [Integer] Minimum word size [11] --exp_value [Float] Maximum e-value [1e-5] --identity_percen [Float] Minimum percentage identity [25] --mis_penalty_b [Integer] Penalty score for a nucleotide mismatch [-3] --gap_cost_b [Integer] Cost to open a gap [-1] --gap_extension_b [Integer] Cost to extend a gap [-1] Result filter options --hsp_cover [Float] Coverage cutoff of a reported virus contig by reference virus sequences [0.75] --coverage_cutoff [Float] Coverage cutoff of a reported virus reference sequences by assembled virus contigs [0.1] --depth_cutoff [Float] Depth cutoff of a reported virus reference [5] --siRNA_precent [Float] Proportion cutoff of 21-nt and 22-nt siRNAs for viral-like contigs [0.5]