Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of...
Transcript of Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of...
![Page 1: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/1.jpg)
Quality Control of Sequencing Data
Surya Saha
Sol Genomics Network (SGN)
Boyce Thompson Institute, Ithaca, [email protected] // Twitter:@SahaSurya
BTI Plant Bioinformatics Course 2015
Slides: Aureliano Bombarely
3/31/2015 BTI Plant Bioinformatics Course 2015 1
![Page 2: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/2.jpg)
1. Evaluation
2. Preprocessing
Quality Control of NGS Data
3/31/2015 BTI Plant Bioinformatics Course 2015 2
![Page 3: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/3.jpg)
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as quality score and
length distributions and unusual reads duplications.
Data: (Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools: tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
3/31/2015 BTI Plant Bioinformatics Course 2015 3
![Page 4: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/4.jpg)
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq,
SRR404334_ch4.fq and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
Evaluation
3/31/2015 BTI Plant Bioinformatics Course 2015 4
![Page 5: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/5.jpg)
Exercise 1:
4. Count number of sequences in each fastq file using
commands you learnt last time.
5. Convert the fastq files to fasta.
6. Explore other tools in the FASTX toolkit.
7. Now count the number of sequences in fasta file and see
if the number of sequences has changed.
Evaluation
Tip: Use ‘grep’
Tip: Use ‘fastq_to_fasta -h’ to see helpUse Google if you are stuck
3/31/2015 BTI Plant Bioinformatics Course 2015 5
![Page 6: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/6.jpg)
Evaluation: Sequence Quality
Good Illumina dataset
3/31/2015 BTI Plant Bioinformatics Course 2015 6
![Page 7: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/7.jpg)
Evaluation: Sequence Quality
3/31/2015 BTI Plant Bioinformatics Course 2015 7
Good Illumina dataset
Poor Illumina dataset
![Page 8: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/8.jpg)
Evaluation: Sequence Quality
3/31/2015 BTI Plant Bioinformatics Course 2015 8
454
Pacific Biosciences
![Page 9: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/9.jpg)
Evaluation: Sequence Content
Good Illumina dataset
3/31/2015 BTI Plant Bioinformatics Course 2015 9
![Page 10: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/10.jpg)
Evaluation: Sequence Content
3/31/2015 BTI Plant Bioinformatics Course 2015 10
Good Illumina dataset
Poor Illumina dataset
![Page 11: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/11.jpg)
Evaluation: Duplication
Good Illumina dataset
3/31/2015 BTI Plant Bioinformatics Course 2015 11
![Page 12: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/12.jpg)
Evaluation: Duplication
3/31/2015 BTI Plant Bioinformatics Course 2015 12
Good Illumina dataset
Poor Illumina dataset
![Page 13: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/13.jpg)
Evaluation: Overrepresented Sequences
Good Illumina dataset
3/31/2015 BTI Plant Bioinformatics Course 2015 13
![Page 14: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/14.jpg)
Evaluation: Overrepresented Sequences
3/31/2015 BTI Plant Bioinformatics Course 2015 14
Good Illumina dataset
Poor Illumina dataset
![Page 15: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/15.jpg)
Evaluation: Kmer content
Good Illumina dataset
3/31/2015 BTI Plant Bioinformatics Course 2015 15
![Page 16: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/16.jpg)
Evaluation: Kmer content
3/31/2015 BTI Plant Bioinformatics Course 2015 16
Good Illumina dataset
Poor Illumina dataset
![Page 17: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/17.jpg)
Evaluation: Kmer content
3/31/2015 BTI Plant Bioinformatics Course 2015 17
454
Pacific Biosciences
![Page 18: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/18.jpg)
Question 2.2: How many sequences there are per file in FastQC?
Question 2.3: Which is the length range for these reads?
Question 2.4: Which is the quality score range for these reads? Which
one looks best quality-wise?
Question 2.5: Do these datasets have read overrepresentation?
Question 2.6: Looking into the kmer content, do you think that the samples
have an adaptor?
EvaluationExercise 2:
1.Type ‘fastqc’ to start the FastQC program. Load the four
fastq sequence files in the program.
3/31/2015 BTI Plant Bioinformatics Course 2015 18
![Page 19: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/19.jpg)
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data: (Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools: fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
3/31/2015 BTI Plant Bioinformatics Course 2015 19
![Page 20: Quality Control of Sequencing Data - WordPress.com · 2015. 3. 31. · Quality Control of Sequencing Data Surya Saha Sol Genomics Network (SGN) Boyce Thompson Institute, Ithaca, NY](https://reader033.fdocuments.us/reader033/viewer/2022051900/5feecc76d433025b5979250f/html5/thumbnails/20.jpg)
Exercise 3:
• Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a
dapters1.fa
• Run the read processing program over each of the datasets
using
• Min. qscore of 30
• Min. length of 40 bp
• Type ‘fastqc’ to start the FastQC program. Load the four
new fastq sequence files. Compare the results with the
previous datasets.
Preprocessing
Tip: Use ‘fastqc -h’ to see help
3/31/2015 BTI Plant Bioinformatics Course 2015 20