De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files)...

23
De-multiplexing & Quality Control Challenges and Solutions Sridhar Srinivasan Bioinformatician Premas Lifescience

Transcript of De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files)...

Page 1: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

De-multiplexing & Quality Control Challenges and Solutions

Sridhar Srinivasan

Bioinformatician

Premas Lifescience

Page 2: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Abstract

• Illumina has multiple sequencing platforms

– that produces large amount of high quality sequence data

in a short time frame.

• To utilize full potential of a sequencing run, we

generally multiplex many samples in to a run.

• This has to be followed by demultiplexing and

quality control of data to get reliable and

reproducible results.

Page 3: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Our Discussion Here Is About

• The tools and strategies to get good demultiplexed

data.

• Also includes how to check various quality

parameters of sequencing data and methods to get

rid of any low quality or contaminated reads.

Page 4: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Data analysis

Images Intensities

Reads Alignments Polymorphisms

Instrument Control Software/RTA

AT G

C

Basecalls

CASAVA 1.8 / MSR/ 3rd party tools

Visualize

Biological results

C/A/G/T

Bcl files

Page 5: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

CASAVA

• CASAVA is a Linux application designed to:

– Translate base calls (.bcl files) to compressed,

demultiplexed FASTQ files

– Align reads

– Call variants (SNPs and indels)

– Assign genotypes to variants

– Count expression level for exons, genes and splice

junctions in case of RNA-seq runs

Page 6: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing overview

• Demultiplexing can be done by:

– CASAVA 1.8.2

– MiSeq Reporter software (for the miSeq)

• Demultiplexing requires a run folder (with bcl files) and a sample sheet

• Demultiplexing occurs during Bcl to Fastqprocessing

• Each index sequence read is compared to the index

sequence specified in the sample sheet

• No quality values are considered in this step

Page 7: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

How does Demultiplexing occurs?

• Illumina sequencing instruments generate *.bcl files as primary

sequencing output.

• CASAVA contains a BCL to FASTQ

converter(configureBclToFastq.pl) that combines these per-cycle

*.bcl files from a run and translates them into FASTQ files.

• In addition to generating FASTQ files, CASAVA uses a user-created

or IEM sample sheet to divide the run output in projects and

samples, and stores these in separate directories.

• If no sample sheet is provided, all samples will be put in the

Undetermined_Indices directory by lane, and not demultiplexed

Page 8: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Samplesheet.csv

Header DescriptionFCID Positive integer indicating lane number (1-8)

SampleID ID of sample

SampleRef The reference sequence to be used for Sample

Index Index sequence

Description Description of the sample

Control Y indicates lane is control lane N means sample

Recipe Recipe used for sequencing

Operator Name or ID of operator

SampleProject The project the sample belongs to

Page 9: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Input Files for configureBclToFastq.pl

• Run Folder (from RTA or

OLB)

– Files actually required

are in the graphic shown

• SampleSheet.csv

– User created (Microsoft

Excel is easiest)

– Saved as *.csv format

– Default directory is in the

BaseCalls Directory

Page 10: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

BCL Conversion and Demultiplexing Invocation

• Create MakeFiles

– Builds the run folder structure and generates the MakeFiles

• cd into the Analysis Directory

– MakeFiles are created in the analysis directory

• Execute MakeFiles

– Start the BCL conversion and Demultiplexing run

Nohup command keep even if process interrupted or if you log out.

The -j option specifies the extent of parallelization

/path/to/CASAVA/bin/configureBclToFastq.pl --input-dir <BaseCalls_DIR>

--output-dir <Unaligned> --sample-sheet <Input DIR>/SampleSheet.csv

cd /path/to/RunFolder/Unaligned

nohup make –j <n> &

Command(s)

Page 11: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Bcl conversion and Demultiplexing

options

• Selected command line options

Page 12: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing output fastq file

• The fastq files are located in the

Unaligned/Project_<ProjectName>/Sample_<SampleNa

me> directories

• Illumina FASTQ files use the following naming scheme:

<sample name>_<barcode sequence>_L<lane (0-

padded to 3digits)>_R<read number>_<set number (0-

padded to 3digits>.fastq.gz

• In the case of non-multiplexed runs, <sample name> will

be replaced with the lane numbers (lane1, lane2, ...,

lane8) and <barcode sequence> will be replaced with

"NoIndex".

Page 13: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing Output Files, FastQ File

@HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

ASCIIValue

PhredScore

Error probability

Character

5 53 20 0.01

? 63 30 0.001

I 73 40 0.0001

Page 14: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplex stat file

Page 15: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Demultiplexing Output Files, Summary File

• The Demultiplex_Stats file is located in the Unaligned/Basecall_Stats_FCID

directory.

Page 16: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Troubleshooting indexes

• Linux command line to determine raw index sequence frequency

HWI-BRUNOP20X:994:B809UWABXX:1:1101:13501:2240 1:N:0:CTTGTA

TGAAACCAGTGTTCTTAATTGGCATTTTACACACACACACACAGAATTTAAAAAAAAAATCAAAGGAAATCATTCTAAATGTACTATGATAGCATGTTAAA

+

=55>7;?::BDADDD@EE88DCD?DFFEFFECBE6666BB=B;<;<-34:;<CB51>=BBEE>EE?3D@??CB->:=:AA8DDDDDDBBE9;,=?:/89<E

• Go to Undetermined_Indices/Sample_lane<n>

• Command line:

gunzip

| awk

| sort

-c lane1_Undetermined_L001_R1_001.fastq.gz \

'{if($2~/:/) {sub(/.*:/,"",$2); print $2}}'\

-n | uniq -c | sort -n -r > index.list.txt

Page 17: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

FastQC

• FastQC provide a simple way to do some quality control

checks on raw sequence data.

• give a quick impression of whether your data has any

problems of which you should be aware before doing any

further analysis.

• Main Functions

-- Import of data from BAM, SAM or FastQ files (any variant)

-- Providing a quick overview to tell which areas there may be

problems

-- Summary graphs and tables to quickly assess your data

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 18: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Quality check

Page 19: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

PRINSEQ

• PRINSEQ can be used to filter, reformat, or trim your genomic and

metagenomic sequence data.

• Fastq files as input

• Sequence data can be filtered to remove sequence copies, short or

long sequences, sequences with N's, low-quality sequences, and

much more.

http://prinseq.sourceforge.net/

/path/to/prinseq-lite –fastq -out_good out/“fastq_filt" -out_bad null -

trim_right 10 -ns_max_p 5 -lc_method dust -lc_threshold 10 -no_qual_header

Command(s)

Page 20: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Filtering

Before Trimming After Trimming

Page 21: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Adaptor Trimming

• Adaptor trimming done before downstream analysis

• If the read length is shorter than actual insert size, there

is no need to do trimming.

• --adaptor--masking .fasta file (CASAVA)

21

Page 22: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

QC and filtering softwares

• Other softwares performing either of these

or both:

-- cutadaptor

--Trim Galore

--Trimmomatic

--Sickle/scythe

-- Fastx Toolkit

22

Page 23: De-multiplexing & Quality Control Challenges and Solutions · –Translate base calls (.bcl files) to compressed, demultiplexed FASTQ files –Align reads –Call variants (SNPs and

Thank you!