Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Post on 25-Jul-2020

27 views 0 download

Transcript of Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Introduction to Bioinformatics: RNA-Seq Analysis

Hamza Farooq, MSc.LMP Seminar Series

15 October 2018

2Module #: Title of Module

Outline of seminar

- Next Generation Sequencing (NGS) overview

- Types of data are generated from raw sequenced read files (fastq, SAM, BAM)

- How to download publicly available RNA-Seq data from Gene Expression Omnibus

- Preliminary differential expression analysis of publicly available RNA-Seq data using Galaxy

Module 1

Genome VariationTwo unrelated humans have genomes that are ~99.8%similar by sequence (~ 3-4 million differences).Most differences are small, e.g. Single Nucleotide Polymorphisms (SNPs).

Human and chimpanzee genomes are about 96%similar

Pictures: http://www.dana.org/news/publications/detail.aspx?id=24536,http:// en.wikipedia.org/wiki/Chimpanzee

bioinformatics.ca

bioinformatics.ca

Sanger Sequencing

Slide credit: AaronQuinlan

use dideoxynucleotides toinhibit elongation of a DNAstrand

separate strands withgel electrophoresis

Sequencing genomes in

Months and Years

Sequencing genomes in

HOURS/Minutes !!

Technology revolution

Projects cost: Billions $ Thousands $

DNA sequencing: DNAPolymerase

C

T

A C

A

G

G

G

A

C

Single-stranded DNA template

Free nucleotides

DNApolymerase

+ + DNA Pol

G

C G

C

G C

A T

DN

AP

ol

Strand synthesis

zip!

DNA polymerase moves along the template in onedirection, integrating complementary nucleotides as it goes

=

G C3’ 5’

bioinformatics.ca

Template Cluster

Polymerasechain reaction(PCR)

Sequencing by synthesis (Solexa/Illumina)

Repeatedly inject mixture of color-labeled

nucleotides (A, C, G and T) and DNApolymerase. When a complementary nucleotide is added to a cluster, the corresponding color of light is emitted. Capture images of this as it happens.

DNAPolymerase DNA

Polymerase

+

~

~

Pretend these areclusters

(snap)

Shown here is just the firstsequencing cycle

bioinformatics.ca

Sequencing by synthesis (Solexa/Illumina)

~

Line up images and, for each cluster, turn the series of light signals into corresponding series of nucleotides

~

~

~

Cycle 1 Cycle 2

~

~Cycle 3

~

~

Cycle 4

~

~

Cycle 5

Sequencing by synthesis(Solexa/Illumina)

bioinformatics.ca

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

Instrument: flowcell lane: tile number: x: y # index for multiplexed sample: member of pair

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

+

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

+

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

+

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

After generation of FASTQ data

• What to do with the obtained reads?

• Most of the time the reads will be aligned to a reference genome

– Leverage high quality assemblies of existing species with each individual sequencing

SAM/BAM

• Used to store alignments

• SAM = text, BAM = binary

SRR013667.1 99 19 8882171 60 76M = 8882214 119

NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA

#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>

Read name Flag Reference Position CIGAR Mate Position

Bases

Base Qualities

Sample1.bam

Sample2.bamSRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAG

between 10Gb to 500Gb each bam

SAM: Sequence Alignment/Map format

Summary

• NGS technology allows generation of hundreds of millions of reads for different high throughput purposes, including transcript quantification and genome sequencing.

• FastQ format – DNA sequence + quality of each nucleotide for each read from sequencer.

• SAM format – FastQ + alignment info (chromosome, start, end for each reads)

• BAM format – SAM converted to binary form to conserve space

How to get and processsequencing data

• Overview of 3 useful websites:

- Gene Expression Omnibus

- Repository for descriptions of sequencing daa generated from studies

- Sequence Read Archive

- Repository for downloading publicly accessible sequencing data

- Galaxy

- Online platform for free bioinformatics processing

Gene Expression Omnibus

Gene Expression Omnibus: Datasets

Sequence Read Archive

Galaxy

Workflows / Pipelines

Workflows / Pipelines

Demo of “at home” RNA-Seq Analysis

First, we need our input data (GEO)

Description of the sample

SRA Interactive Download Page

All files related to that experiment are retrieved

Can filter the data before downloading, and can also download in FASTA format (ie. FASTQ but with no quality information)

Alternatively, SRA command line

- Available for all OS formats- Can download reads more precisely

Disclaimer

• The dataset being analyzed is a FASTA file using old sequencing technology• Subsampled reads for carcinoma vs matched normal

samples

• However, the steps to analyze publicly downloaded or your own data using Galaxy would be similar

• Key take away: familiarize yourself with the Galaxy work environment

General overview of RNA-Seq Pipeline

Raw FASTQ Data

QC Passed Reads

Aligned BAM

Quantified Transcripts

Final DE Gene List

QC Checking, Adapted Trimming, low quality

base trimming

FastQC, CutAdapt, Trimmgalore

Alignment to reference genome/ transcriptome

STAR, HISAT2, TopHat

Quantification of transcripts

HTSeq, StringTie, Cufflinks, RSEM

Normalization and Differential Expression

between Genes

DESeq2, BallGown, CuffDiff, EBSeq

Going to Galaxy

Uploading Data

Uploading Data

Uploading Data

Uploading Data

*Names changed, and custom reference and gtf(annotation files) uploaded

Alignment to Reference

Alignment to Reference

Alignment to Reference

Transcript Quantification

We’ll want the “assembled transcripts” output

Merging Quantified Transcripts

This will produce a singular combined matrix that has our counts matrix

Differential Expression

Click on execute

Viewing Results

Exporting Results

Exporting Results

How does the whole workflow look?

Extracting Workflow

Extracting Workflow

Extracting Workflow

Publicly Available Datasets

TCGA: The Cancer Genome Atlas

Publicly Available Datasets

International Cancer Genome Consortium

Publicly Available Datasets

R2: Genomics Analysis and Visualization Platform

• The lecture slides shown today were adapted from the instructors at the Canadian Bioinformatics Workshop (bioinformaticsdotca.gihub.io)

– Workshops range from High Throughput sequencing, RNA-Seq analysis, Epigenomic data analysis etc.

– Their course content is free to use!!

• Look at biostars.org for questions and help along the way

Acknowledgements

Thank you

Any questions?