Introduction to Bioinformatics: RNA-Seq Analysis

Hamza Farooq, MSc.LMP Seminar Series

15 October 2018

2Module #: Title of Module

Outline of seminar

- Next Generation Sequencing (NGS) overview

- Types of data are generated from raw sequenced read files (fastq, SAM, BAM)

- How to download publicly available RNA-Seq data from Gene Expression Omnibus

- Preliminary differential expression analysis of publicly available RNA-Seq data using Galaxy

Module 1

Genome VariationTwo unrelated humans have genomes that are ~99.8%similar by sequence (~ 3-4 million differences).Most differences are small, e.g. Single Nucleotide Polymorphisms (SNPs).

Human and chimpanzee genomes are about 96%similar

Pictures: http://www.dana.org/news/publications/detail.aspx?id=24536,http:// en.wikipedia.org/wiki/Chimpanzee

bioinformatics.ca

Sanger Sequencing

Slide credit: AaronQuinlan

use dideoxynucleotides toinhibit elongation of a DNAstrand

separate strands withgel electrophoresis

Sequencing genomes in

Months and Years

Sequencing genomes in

HOURS/Minutes !!

Technology revolution

Projects cost: Billions $ Thousands $

DNA sequencing: DNAPolymerase

Single-stranded DNA template

Free nucleotides

DNApolymerase

+ + DNA Pol

Strand synthesis

DNA polymerase moves along the template in onedirection, integrating complementary nucleotides as it goes

G C3’ 5’

bioinformatics.ca

Template Cluster

Polymerasechain reaction(PCR)

Sequencing by synthesis (Solexa/Illumina)

Repeatedly inject mixture of color-labeled

nucleotides (A, C, G and T) and DNApolymerase. When a complementary nucleotide is added to a cluster, the corresponding color of light is emitted. Capture images of this as it happens.

DNAPolymerase DNA

Polymerase

Pretend these areclusters

(snap)

Shown here is just the firstsequencing cycle

bioinformatics.ca

Sequencing by synthesis (Solexa/Illumina)

Line up images and, for each cluster, turn the series of light signals into corresponding series of nucleotides

Cycle 1 Cycle 2

~Cycle 3

Cycle 4

Cycle 5

Sequencing by synthesis(Solexa/Illumina)

bioinformatics.ca

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

Each sample will generate between 5Gb (100x WES) to 300Gb (100x WGS) of data

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

GGCTCATCTTGAACTGGGTGGCGACCGTCCCTGGCCCCTTCTTGACACCCAGCGCNNNNNNNNNNNNNNNNA

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

GAATGAGAGGCCCTCCCCGTGGAGGCATGGTATCCGGCCGAGGGGGCTTAGTCATNNNNNNNNNNNNNNNNC

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

GGCCGCAGTGCCATTGAGCTCACCAAAATGCTCTGTGAAATCCTGCAGGTTGGGGANNNNNNNNNNNNNNGA

DFBH?GDEG>GEGGDHH>HBDBEGD8G<GG<DGGGCB><82???@DDBBDDGGE##################

file:///Users/flefebvr/Downloads/fq.txt

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

HeaderSequencePlace holderQuality

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

Instrument: flowcell lane: tile number: x: y # index for multiplexed sample: member of pair

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

FASTQ format

Sample1_R1.fastq.gz

Sample2_R1.fastq.gz

End 1 End 2

@ERR127302.1 HWI-EAS350_0441:1:1:1055:4898#0/1

4=B@D99BDDDDDDD:DD?B<<=?>6B#############################################

@ERR127302.2 HWI-EAS350_0441:1:1:1056:1163#0/1

B?,B2,?=?1?1B?D@?:@?DB3>AD,8DD??-B?#####################################

@ERR127302.3 HWI-EAS350_0441:1:1:1057:13164#0/1

1 of 1 13-05-31 10:43 AM

Sample1_R2.fastq.gz

Sample2_R2.fastq.gz

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

After generation of FASTQ data

• What to do with the obtained reads?

• Most of the time the reads will be aligned to a reference genome

– Leverage high quality assemblies of existing species with each individual sequencing

SAM/BAM

• Used to store alignments

• SAM = text, BAM = binary

SRR013667.1 99 19 8882171 60 76M = 8882214 119

NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAGCAGA

#>A@BABAAAAADDEGCEFDHDEDBCFDBCDBCBDCEACB>AC@CDB@>

Read name Flag Reference Position CIGAR Mate Position

Base Qualities

Sample1.bam

Sample2.bamSRR013667.1 99 19 8882171 60 76M = 8882214 119 NCCAGCAGCCATAACTGGAATGGGAAATAAACACTATGTTCAAAG

between 10Gb to 500Gb each bam

SAM: Sequence Alignment/Map format

Summary

• NGS technology allows generation of hundreds of millions of reads for different high throughput purposes, including transcript quantification and genome sequencing.

• FastQ format – DNA sequence + quality of each nucleotide for each read from sequencer.

• SAM format – FastQ + alignment info (chromosome, start, end for each reads)

• BAM format – SAM converted to binary form to conserve space

How to get and processsequencing data

• Overview of 3 useful websites:

- Gene Expression Omnibus

- Repository for descriptions of sequencing daa generated from studies

- Sequence Read Archive

- Repository for downloading publicly accessible sequencing data

- Galaxy

- Online platform for free bioinformatics processing

Gene Expression Omnibus

Gene Expression Omnibus: Datasets

Sequence Read Archive

Galaxy

Workflows / Pipelines

Demo of “at home” RNA-Seq Analysis

First, we need our input data (GEO)

Description of the sample

SRA Interactive Download Page

All files related to that experiment are retrieved

Can filter the data before downloading, and can also download in FASTA format (ie. FASTQ but with no quality information)

Alternatively, SRA command line

- Available for all OS formats- Can download reads more precisely

Disclaimer

• The dataset being analyzed is a FASTA file using old sequencing technology• Subsampled reads for carcinoma vs matched normal

samples

• However, the steps to analyze publicly downloaded or your own data using Galaxy would be similar

• Key take away: familiarize yourself with the Galaxy work environment

General overview of RNA-Seq Pipeline

Raw FASTQ Data

QC Passed Reads

Aligned BAM

Quantified Transcripts

Final DE Gene List

QC Checking, Adapted Trimming, low quality

base trimming

FastQC, CutAdapt, Trimmgalore

Alignment to reference genome/ transcriptome

STAR, HISAT2, TopHat

Quantification of transcripts

HTSeq, StringTie, Cufflinks, RSEM

Normalization and Differential Expression

between Genes

DESeq2, BallGown, CuffDiff, EBSeq

Going to Galaxy

Uploading Data

*Names changed, and custom reference and gtf(annotation files) uploaded

Alignment to Reference

Transcript Quantification

We’ll want the “assembled transcripts” output

Merging Quantified Transcripts

This will produce a singular combined matrix that has our counts matrix

Differential Expression

Click on execute

Viewing Results

Exporting Results

How does the whole workflow look?

Extracting Workflow

Publicly Available Datasets

TCGA: The Cancer Genome Atlas

International Cancer Genome Consortium

R2: Genomics Analysis and Visualization Platform

• The lecture slides shown today were adapted from the instructors at the Canadian Bioinformatics Workshop (bioinformaticsdotca.gihub.io)

– Workshops range from High Throughput sequencing, RNA-Seq analysis, Epigenomic data analysis etc.

– Their course content is free to use!!

• Look at biostars.org for questions and help along the way

Acknowledgements

Thank you

Any questions?

Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Transcript of Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Documents

Transcript of Introduction to Bioinformatics: RNA-Seq Analysis · Introduction to Bioinformatics: RNA-Seq...

Statistics of RNA-seq data analysis - Cornell University2020/11/09 · Statistics of RNA-seq data analysis Jeff Glaubitz and Qi Sun November 9, 2020 Bioinformatics Facility Institute

RNA-seq to study HIV · RNA-seq to study HIV Infection in cells Rebecca Batorsky Sr Bioinformatics Specialist Feb 2020

Tutorial - QIAGEN Bioinformatics€¦ · Four workflows: 1.RNA-Seq and IPA analysis workflow 2.RNA-Seq and IPA advanced analysis workflow 3.RNA-Seq analysis workflow 4.RNA-Seq analysis

Diﬀerential gene expression analysis using RNA-seqchagall.med.cornell.edu/RNASEQcourse/Slides_2017-08-16.pdf · Diﬀerential gene expression analysis using RNA-seq Applied Bioinformatics

RNA-seq from a bioinformatics perspective · 2017-11-07 · RNA-Seq - Stranded . Differential expression. Rakesh Kaundal et al. Normalization of RNA -seq . Total count (TC) : Gene

Statistical analysis of RNA-Seq dataMake an experimental design Context of a RNA-seq experiment Rule 0 :Share a common language in biology, bioinformatics and statistics. Experimental

RNA-seqData Analysis - Cornell Universitycbsu.tc.cornell.edu/lab/doc/RNA-Seq-2015-02-Lecture1.pdf · RNA-seqData Analysis Qi Sun Bioinformatics Facility Biotechnology Resource Center

Rna seq and chip seq

RNA-seq data analysis with Bioconductor · RNA-seq data analysis with Bioconductor Ângela Filimon Gonçalves Functional Genomics Team at the European Bioinformatics Institute - EMBL

Scalable bioinformatics for discovery with RNA-seq...SCALABLE BIOINFORMATICS FOR DISCOVERY WITH RNA-SEQ Advances in sequencing technologies are enabling researchers to identify RNA

Computational approaches to the analysis of RNA-seq data I519 Introduction to Bioinformatics, Fall, 2012.

Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies · 2013-08-15 · Statistical Genomics and Bioinformatics Workshop 8/16/2013 1 Statistical

Introduction to RNA-Seq - University of California, Davis...Introduction to RNA-Seq Monica Britton, Ph.D. Bioinformatics Analyst December 2014 Workshop Overview of RNA-Seq Activities

Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Course - Session 4.1 - VHIR, Barcelona)

Bioinformatics for DNA - seq and RNA- seq experiments

Bioinformatics for RNA-seq - GitHub Pages

Analysis of RNA-seq Data - Bioinformatics-core-shared ...bioinformatics-core-shared-training.github.io/cruk-bioinf-sschool/... · Differential Expression Mortazavi, A. et al (2008)

RNA-Seq analysis workshop - · PDF fileOutline • Background of RNA-Seq • Application of RNA-Seq (what RNA-Seq can do?) • Available sequencing platforms and strategies and which

Applied Bioinformatics Journal Club Pacbio RNA-Seq

Bioinformatics Pipelines for RNA- Seq Data Analysis