RNAseq Introduction - biology.umd.edu

32
Bioinformatics Core RNAseq Introduction Ian Misner, Ph.D. Bioinformatics Crash Course

Transcript of RNAseq Introduction - biology.umd.edu

Page 1: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

RNAseq Introduction

Ian Misner, Ph.D. Bioinformatics Crash Course

Page 2: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 3: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Many types of RNA

•  rRNA, tRNA, mRNA, miRNA, ncRNA, etc. •  ~2% is mRNA

Page 4: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Why sequence RNA

•  Functional studies – Drug treated vs untreated cell line – Wild type vs knock out

•  SNP finding •  Transcriptome assembly •  Novel gene finding •  Splice variant analysis

Page 5: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Challenges

•  Sampling –  Purity?, quantity?, quality?

•  Exons can be problematic – Mapping reads can become difficult

•  RNA abundances vary by orders of magnitude – Highly expressed genes can over power genes of interest – Organeller RNA can block overall signal

•  RNA is fragile and must be properly handled •  RNA population turns over quickly within a cell.

Page 6: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

General workflows •  Obtain raw data •  Align/assemble reads •  Process alignment with a tool specific to the goal •  e.g. ‘cufflinks’ ‘sailfish’

•  Post process •  Import into downstream software (R, Matlab,

Cytoscape, etc.) •  Summarize and visualize •  Create gene lists, prioritize candidates for validation,

etc.

Page 7: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Experimental Design Questions •  What is my biological question? •  How much sequencing do I need? •  What type of sequencing should I do? – Read length? – Which platform? –  SE or PE?

•  How much multiplexing can I do? •  Should I pool samples? •  How many replicates do I need? •  What about duplicates?

Page 8: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

What are you working with? •  Novel – little or no data •  Some data – ESTs or Unigenes •  Basic Draft Genome

–  Few thousand contigs –  Some annotation, mostly ab initio

•  Good Draft Genome –  Few thousand scaffolds to chromosome arms –  Better annotations with human verification

•  Model Organism –  Fully sequenced genome –  High confidence annotations –  Genetic maps and markers –  Mutant data available

Page 9: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

(a) Increase in biological replication significantly increases the number of DE genes identified.

Liu Y et al. Bioinformatics 2014;30:301-304

Number of Reads/Replicates

Page 10: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Read Type and Platform Read  Type   Pla+orm   Uses  

50  SE   Illumina   Gene  Expression  Quan5fica5on  SNP-­‐finding    (Good  Reference)  

50  PE   Illumina   Above  plus  Splice  variants  

100+  PE   Illumina   Above  plus  Transcriptome  assembly  DE  within  gene  families  

200+   Ion  Torrent  Sanger  454  Nanopore  

Splice  variants    Transcriptome  assembly  Haplotypes  Too  large  for  DE  

Page 11: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Read Platform

Perdue  University  Discovery  Park  

Page 12: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Multiplexing

•  6-8 nt barcodes added to samples during library prep.

•  Allows for pooling of samples into the same lane. – Mitigate lane effects – Maximize sequencing efficiency

•  Dual barcoding allows for up to 96 samples per lane.

Page 13: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 14: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Replicates

•  Biological – Measurement of variation between samples – More are better – Can detect genetic variation between samples – Pooling with barcodes – each sample is a replicate – Pooling without barcodes – each pool is a replicate

Page 15: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Replicates

•  Technical – Can determine variation within sample preparation. – Can be cost prohibitive. – More biological replicate are better. – Useful across lanes to mitigate lane effects.

Page 16: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 17: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 18: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Should I remove duplicates? •  Maybe… – Duplicates may correspond to biased PCR amplification

of particular fragments – For highly expressed, short genes, duplicates are

expected even if there is no amplification bias – Removing them may reduce the dynamic range of

expression estimates •  Assess library complexity and decide… •  If you do remove them, assess duplicates at the level of

paired-end reads, not single end reads

Page 19: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 20: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Processing RNA for Sequencing

•  Depends upon what you’re looking to achieve. •  mRNA is the main target •  PolyA Selection – Oligo-dT beads – Highly efficient at getting mRNA and depleting the

rRNA – Can’t be used with non-polyA RNA

•  miRNA kits as well…

Page 21: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Strand Specific Sequencing

•  Illumina prep that ligates adaptors to 5’ and 3’ ends of RNA prior to cDNA reverse transcription

•  Having strand information makes mapping more straightforward.

•  Can identify antisense transcripts

5’   3’  

Page 22: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Insert Sizes

Page 23: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Alignment Options •  No Genome?! No Problem! – Transcriptome assembly – There will be redundancy

•  NCBI Unigene Set – Not necessarily complete – Good to identify highly expressed genes

•  Valid Transcripts from you organism – Easy to use but may miss novel genes

•  Fully Sequenced and Annotated Genome – No excuses this better be a Nature paper!

Page 24: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Mapping RNAseq Reads

•  How many mismatches will you allow? – Depends on what your mapping and what your using for

a reference.

•  Number of hits allowed? – How many times can a read match in different locations?

•  Splice Junctions? –  Is your mapping tool “splice aware”?

•  Expected distance for PE reads? – This is important to know so that read pairs can map

properly.

Page 25: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Why PE reads are great

   

   

   

   

   

2  Mismatches   Exact  Match  

Page 26: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Perdue  University  Discovery  Park  

Page 27: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

RNAseq Pipeline

TopHat  

Cufflinks  

Cuffcompare  

CuffDiff  

CummRbund  

Page 28: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

There are other options

Page 29: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Not all software is created equal

Page 30: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

Page 31: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

RNAseq “Best Practices”

•  Platform –  Illumina HiSeq

•  Read Length – Minimum of 50bp 100bp is better

•  Paired-end or Single – PE

•  Read Depth – 30-40 million/sample

Page 32: RNAseq Introduction - biology.umd.edu

Bioinformatics  Core

RNAseq “Best Practices” •  Number of biological replicates

–  3 or more as cost allows •  Experimental Design

–  Balanced Block •  What type of alignment

–  TopHat – Highly confident and splice aware •  Unique or Multiple mapping

–  Unique –  70-90% mapping rate

•  Analysis Method –  Use more than one approach –  Know the limits of the experiment