De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files)...
Transcript of De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files)...
![Page 1: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/1.jpg)
De-multiplexing & Quality Control Challenges and Solutions
Sridhar Srinivasan
Bioinformatician
Premas Lifescience
![Page 2: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/2.jpg)
Abstract
• Illumina has multiple sequencing platforms
– that produces large amount of high quality sequence data
in a short time frame.
• To utilize full potential of a sequencing run, we
generally multiplex many samples in to a run.
• This has to be followed by demultiplexing and
quality control of data to get reliable and
reproducible results.
![Page 3: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/3.jpg)
Our Discussion Here Is About
• The tools and strategies to get good demultiplexed
data.
• Also includes how to check various quality
parameters of sequencing data and methods to get
rid of any low quality or contaminated reads.
![Page 4: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/4.jpg)
Data analysis
Images Intensities
Reads Alignments Polymorphisms
Instrument Control Software/RTA
AT G
C
Basecalls
CASAVA 1.8 / MSR/ 3rd party tools
Visualize
Biological results
C/A/G/T
Bcl files
![Page 5: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/5.jpg)
CASAVA
• CASAVA is a Linux application designed to:
– Translate base calls (.bcl files) to compressed,
demultiplexed FASTQ files
– Align reads
– Call variants (SNPs and indels)
– Assign genotypes to variants
– Count expression level for exons, genes and splice
junctions in case of RNA-seq runs
![Page 6: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/6.jpg)
Demultiplexing overview
• Demultiplexing can be done by:
– CASAVA 1.8.2
– MiSeq Reporter software (for the miSeq)
• Demultiplexing requires a run folder (with bcl files) and a sample sheet
• Demultiplexing occurs during Bcl to Fastqprocessing
• Each index sequence read is compared to the index
sequence specified in the sample sheet
• No quality values are considered in this step
![Page 7: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/7.jpg)
How does Demultiplexing occurs?
• Illumina sequencing instruments generate *.bcl files as primary
sequencing output.
• CASAVA contains a BCL to FASTQ
converter(configureBclToFastq.pl) that combines these per-cycle
*.bcl files from a run and translates them into FASTQ files.
• In addition to generating FASTQ files, CASAVA uses a user-created
or IEM sample sheet to divide the run output in projects and
samples, and stores these in separate directories.
• If no sample sheet is provided, all samples will be put in the
Undetermined_Indices directory by lane, and not demultiplexed
![Page 8: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/8.jpg)
Samplesheet.csv
Header DescriptionFCID Positive integer indicating lane number (1-8)
SampleID ID of sample
SampleRef The reference sequence to be used for Sample
Index Index sequence
Description Description of the sample
Control Y indicates lane is control lane N means sample
Recipe Recipe used for sequencing
Operator Name or ID of operator
SampleProject The project the sample belongs to
![Page 9: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/9.jpg)
Input Files for configureBclToFastq.pl
• Run Folder (from RTA or
OLB)
– Files actually required
are in the graphic shown
• SampleSheet.csv
– User created (Microsoft
Excel is easiest)
– Saved as *.csv format
– Default directory is in the
BaseCalls Directory
![Page 10: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/10.jpg)
BCL Conversion and Demultiplexing Invocation
• Create MakeFiles
– Builds the run folder structure and generates the MakeFiles
• cd into the Analysis Directory
– MakeFiles are created in the analysis directory
• Execute MakeFiles
– Start the BCL conversion and Demultiplexing run
Nohup command keep even if process interrupted or if you log out.
The -j option specifies the extent of parallelization
/path/to/CASAVA/bin/configureBclToFastq.pl --input-dir <BaseCalls_DIR>
--output-dir <Unaligned> --sample-sheet <Input DIR>/SampleSheet.csv
cd /path/to/RunFolder/Unaligned
nohup make –j <n> &
Command(s)
![Page 11: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/11.jpg)
Bcl conversion and Demultiplexing
options
• Selected command line options
![Page 12: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/12.jpg)
Demultiplexing output fastq file
• The fastq files are located in the
Unaligned/Project_<ProjectName>/Sample_<SampleNa
me> directories
• Illumina FASTQ files use the following naming scheme:
<sample name>_<barcode sequence>_L<lane (0-
padded to 3digits)>_R<read number>_<set number (0-
padded to 3digits>.fastq.gz
• In the case of non-multiplexed runs, <sample name> will
be replaced with the lane numbers (lane1, lane2, ...,
lane8) and <barcode sequence> will be replaced with
"NoIndex".
![Page 13: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/13.jpg)
Demultiplexing Output Files, FastQ File
@HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA
TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA
+
=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E
ASCIIValue
PhredScore
Error probability
Character
5 53 20 0.01
? 63 30 0.001
I 73 40 0.0001
![Page 14: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/14.jpg)
Demultiplex stat file
![Page 15: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/15.jpg)
Demultiplexing Output Files, Summary File
• The Demultiplex_Stats file is located in the Unaligned/Basecall_Stats_FCID
directory.
![Page 16: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/16.jpg)
Troubleshooting indexes
• Linux command line to determine raw index sequence frequency
HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA
TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA
+
=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E
• Go to Undetermined_Indices/Sample_lane<n>
• Command line:
gunzip
| awk
| sort
-c lane1_Undetermined_L001_R1_001.fastq.gz \
'{if($2~/:/) {sub(/.*:/,"",$2); print $2}}'\
-n | uniq -c | sort -n -r > index.list.txt
![Page 17: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/17.jpg)
FastQC
• FastQC provide a simple way to do some quality control
checks on raw sequence data.
• give a quick impression of whether your data has any
problems of which you should be aware before doing any
further analysis.
• Main Functions
-- Import of data from BAM, SAM or FastQ files (any variant)
-- Providing a quick overview to tell which areas there may be
problems
-- Summary graphs and tables to quickly assess your data
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
![Page 18: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/18.jpg)
Quality check
![Page 19: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/19.jpg)
PRINSEQ
• PRINSEQ can be used to filter, reformat, or trim your genomic and
metagenomic sequence data.
• Fastq files as input
• Sequence data can be filtered to remove sequence copies, short or
long sequences, sequences with N's, low-quality sequences, and
much more.
http://prinseq.sourceforge.net/
/path/to/prinseq-lite –fastq -out_good out/“fastq_filt" -out_bad null -
trim_right 10 -ns_max_p 5 -lc_method dust -lc_threshold 10 -no_qual_header
Command(s)
![Page 20: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/20.jpg)
Filtering
Before Trimming After Trimming
![Page 21: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/21.jpg)
Adaptor Trimming
• Adaptor trimming done before downstream analysis
• If the read length is shorter than actual insert size, there
is no need to do trimming.
• --adaptor--masking .fasta file (CASAVA)
21
![Page 22: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/22.jpg)
QC and filtering softwares
• Other softwares performing either of these
or both:
-- cutadaptor
--Trim Galore
--Trimmomatic
--Sickle/scythe
-- Fastx Toolkit
22
![Page 23: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and](https://reader034.fdocuments.us/reader034/viewer/2022050509/5f99faaca00cdd425415248f/html5/thumbnails/23.jpg)
Thank you!