RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes,...

32
RNAseq Introduc1on Promises and pi7alls

Transcript of RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes,...

Page 1: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA-­‐seq  Introduc1on  

Promises  and  pi7alls  

Page 2: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

DNA  is  the  same  in  all  cells  but  which  RNAs  that  is  present    is  different  in  all  

cells  

Page 3: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

There  is  a  wide  variety  of  different  func1onal  RNAs  

Page 4: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Which  RNAs  (and  some1mes  then  translated  to  proteins)  varies  between  

samples      -­‐Tissues    -­‐ Cell  types  

-­‐ Cell  states  

-­‐Individuals    -­‐Cells  

Page 5: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA  gives  informa1on  on  which  genes  that  are  expressed  

How  DNA  get  transcribed  to  RNA  (and  some1mes  then  translated  to  proteins)  varies  between  e.  g.    -­‐Tissues    -­‐ Cell  types  

-­‐ Cell  states  

-­‐Individuals  

Page 6: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA  gives  informa1on  on  which  genes  that  are  expressed  

How  DNA  get  transcribed  to  RNA  (and  some1mes  then  translated  to  proteins)  varies  between  e.  g.    -­‐Tissues    -­‐ Cell  types  

-­‐ Cell  states  

-­‐Individuals  

Page 7: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA  flavors    (pre  sequencing  era)  

•  House  keeping  RNAs  –  rRNAs,  tRNAs,  snoRNAs,  snRNAs,  SRP  RNAs,  cataly1c  RNAs  (RNAse  E)  

•  Protein  coding  RNAs  –  (1  coding  gene  ~  1  mRNA)  

•  Regulatory  RNAs  –  Few  rare  examples  

Page 8: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

ENCODE,  the  Encyclopedia  of  DNA  Elements,  is  a  project  funded  by  the  Na1onal  Human  Genome  Research  Ins1tute  to  iden1fy  all  regions  of  transcrip1on,  transcrip1on  factor  associa1on,  chroma1n  structure  and  histone  modifica1on  in  the  human  genome  sequence.  

Page 9: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

ENCyclopedia  Of  Dna  Elements  

Page 10: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Different  kind  of  RNAs  have  different  expression  values  

Landscape  of  transcrip/on  in  human  cells,  S  Djebali  et  al.  Nature  2012    

Page 11: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

What  defines  RNA  depends  on  how  you  look  at  it  

 

Variants  

Adapted  from  Landscape  of  transcrip/on  in  human  cells,  S  Djebali  et  al.  Nature  2012    

Abundance   House  keeping  RNAs  

mRNAs  

Regulatory  RNAs  

Novel  intergenic  

None  

Coverage  

Page 12: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

One  gene  many  different  mRNAs  RNA-seq: alternative splicing

Page 13: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Defining  func1onal  DNA  elements  in  

the  human  genome    •  Statement  

–  A  priori,  we  should  not  expect  the  transcriptome  to  consist  exclusively  of  func1onal  RNAs.    

•  Why  is  that  –  Zero  tolerance  for  errant  

transcripts  would  come  at  high  cost  in  the  proofreading  machinery  needed  to  perfectly  gate  RNA  polymerase  and  splicing  ac1vi1es,  or  to  instantly  eliminate  spurious  transcripts.  

–  In  general,  sequences  encoding  RNAs  transcribed  by  noisy  transcrip1onal  machinery  are  expected  to  be  less  constrained,  which  is  consistent  with  data  shown  here  for  very  low  abundance  RNA    

 

•  Consequence  –  Thus,  one  should  have  high  

confidence  that  the  subset  of  the  genome  with  large  signals  for  RNA  or  chroma1n  signatures  coupled  with  strong  conserva1on  is  func1onal  and  will  be  supported  by  appropriate  gene1c  tests.    

–  In  contrast,  the  larger  propor1on  of  genome  with  reproducible  but  low  biochemical  signal  strength  and  less  evolu1onary  conserva1on  is  challenging  to  parse  between  specific  func1ons  and  biological  noise.  

 

Page 14: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

This  is  of  course  not  without  an  debate  Variants   Abundance  

Most ‘‘Dark Matter’’ Transcripts Are Associated WithKnown GenesHarm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1 Banting and Best Department of Medical Research, University of Toronto, Toronto, Ontario, Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto,

Ontario, Canada

Abstract

A series of reports over the last few years have indicated that a much larger portion of the mammalian genome istranscribed than can be accounted for by currently annotated genes, but the quantity and nature of these additionaltranscripts remains unclear. Here, we have used data from single- and paired-end RNA-Seq and tiling arrays to assess thequantity and composition of transcripts in PolyA+ RNA from human and mouse tissues. Relative to tiling arrays, RNA-Seqidentifies many fewer transcribed regions (‘‘seqfrags’’) outside known exons and ncRNAs. Most nonexonic seqfrags are inintrons, raising the possibility that they are fragments of pre-mRNAs. The chromosomal locations of the majority ofintergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation siteusage, promoter- and terminator-associated transcripts, or new alternative exons; indeed, reads that bridge splice sitesidentified 4,544 new exons, affecting 3,554 genes. Most of the remaining seqfrags correspond to either single reads thatdisplay characteristics of random sampling from a low-level background or several thousand small transcripts (medianlength = 111 bp) present at higher levels, which also tend to display sequence conservation and originate from regions withopen chromatin. We conclude that, while there are bona fide new intergenic transcripts, their number and abundance isgenerally low in comparison to known exons, and the genome is not as pervasively transcribed as previously reported.

Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2010) Most ‘‘Dark Matter’’ Transcripts Are Associated With Known Genes. PLoS Biol 8(5): e1000371.doi:10.1371/journal.pbio.1000371

Academic Editor: Sean R. Eddy, HHMI Janelia Farm, United States of America

Received December 3, 2009; Accepted April 9, 2010; Published May 18, 2010

Copyright: ! 2010 van Bakel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permitsunrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by Genome Canada (http://www.genomecanada.ca) through the Ontario Genomics Institute, the Ontario Research Fund, andMarch of Dimes (http://www.marchofdimes.com). HvB was supported by the Netherlands Organization for Scientific Research (NWO; http://www.nwo.nl) (grantno. 825.06.033) and the Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) (grant no. 193588). The funders had no role in study design,data collection and analysis, decision to publish, or preparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

Abbreviations: APA, alternative cleavage and polyadenylation; BW, bandwidth parameter; CAGE, capped analysis of gene expression; CNV, copy numbervariation; lincRNAs, large intervening noncoding RNAs; ncRNAs, noncoding RNAs; ORF, open reading frames; pasRNA, promoter-associated RNA; TSS, transcriptionstart site; TTS, transcription termination site; TU, transcript unit; TUF, transcript of unknown function

* E-mail: [email protected]

Introduction

In recent years established views of transcription have beenchallenged by the observation that a much larger portion of thehuman and mouse genomes is transcribed than can be accountedfor by currently annotated coding and noncoding genes. The bulkof these findings have come from experiments using ‘‘tiling’’microarrays with probes that cover the non-repetitive genome atregular intervals [1–9], or from sequencing efforts of full-lengthcDNA libraries enriched for rare transcripts [10,11]. Additionally,capped analysis of gene expression (CAGE) in human and mouseshow that a significant number of sequenced 59 tags map tointergenic regions [12]. Estimates of the proportion of transcriptsthat map to locations separate from known exons range from 47%to 80% and are distributed approximately equally between intronsand intergenic regions. Dubbed transcriptional ‘‘dark matter’’[13], the ‘‘hidden’’ transcriptome [1], or transcripts of unknownfunction (TUFs) [4,14], the exact nature of much of this additionaltranscription is unclear, but it has been presumed to comprise acombination of novel protein coding transcripts, extensions ofexisting transcripts, noncoding RNAs (ncRNAs), antisense tran-scripts, and biological or experimental background. Determining

the relative contributions of each of these potential sources isimportant for understanding the nature and possible biologicalfunction of transcriptional dark matter.

Homology searches for transcripts mapping outside knownannotation boundaries [10], as well as cDNA sequencing efforts,indicate that it is still possible to find new exons of protein codinggenes [10,15,16]. The genomic positions of TUFs are also biasedtowards known transcripts [8], suggesting that at least a portionmay represent extensions of current gene annotations. Neverthe-less, the majority of dark matter transcripts is thought to benoncoding [2,4,5,10]. Previous efforts to characterize dark mattertranscripts have revealed the existence of thousands of ncRNAswith evidence for tissue-specific expression [17,18], as well as overa thousand large intervening noncoding RNAs (lincRNAs)originating from intergenic regions bearing chromatin marksassociated with transcription [19]. Other studies have reportednew classes of ncRNAs, such as those that cluster close to thetranscription start sites (TSSs) of protein coding genes [20–24].These promoter-associated RNAs (pasRNAs) typically initiate inthe nucleosome free regions that mark a TSS, with transcriptionoccurring in both directions. Finally, results from the ENCODEpilot project have suggested a highly interleaved structure of the

PLoS Biology | www.plosbiology.org 1 May 2010 | Volume 8 | Issue 5 | e1000371

Perspective

The Reality of Pervasive TranscriptionMichael B. Clark1, Paulo P. Amaral1., Felix J. Schlesinger2., Marcel E. Dinger1, Ryan J. Taft1, John L.

Rinn3, Chris P. Ponting4, Peter F. Stadler5, Kevin V. Morris6, Antonin Morillon7, Joel S. Rozowsky8,

Mark B. Gerstein8, Claes Wahlestedt9, Yoshihide Hayashizaki10, Piero Carninci10, Thomas R. Gingeras2*,

John S. Mattick1*

1 Institute for Molecular Bioscience, University of Queensland, Brisbane, Queensland, Australia, 2 Watson School of Biological Sciences, Cold Spring Harbor Laboratory,

Cold Spring Harbor, New York, United States of America, 3 Broad Institute, Cambridge, Massachusetts, United States of America, 4 MRC Functional Genomics Unit,

Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, United Kingdom, 5 Department of Computer Science, University of Leipzig, Leipzig,

Germany, 6 Department of Molecular and Experimental Medicine, Scripps Research Institute, La Jolla, California, United States of America, 7 Institut Curie, UMR3244-

Pavillon Trouillet Rossignol, Paris, France, 8 Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America, 9 University of

Miami, Miami, Florida, United States of America, 10 Omics Science Center, RIKEN Yokohama Institute, Tsurumi-ku, Yokohama, Kanagawa, Japan

Current estimates indicate that onlyabout 1.2% of the mammalian genomecodes for amino acids in proteins. How-ever, mounting evidence over the pastdecade has suggested that the vast major-ity of the genome is transcribed, wellbeyond the boundaries of known genes, aphenomenon known as pervasive tran-scription [1]. Challenging this view, anarticle published in PLoS Biology by vanBakel et al. concluded that ‘‘the genome isnot as pervasively transcribed as previous-ly reported’’ [2] and that the majority ofthe detected low-level transcription is dueto technical artefacts and/or backgroundbiological noise. These conclusions attract-ed considerable publicity [3–6]. Here, wepresent an evaluation of the analysis andconclusions of van Bakel et al. comparedto those of others and show that (1) theexistence of pervasive transcription issupported by multiple independent tech-niques; (2) re-analysis of the van Bakel etal. tiling arrays shows that their results areatypical compared to those of ENCODEand lack independent validation; and (3)the RNA sequencing dataset used by vanBakel et al. suffered from insufficientsequencing depth and poor transcriptassembly, compromising their ability todetect the less abundant transcripts outsideof protein-coding genes. We conclude thatthe totality of the evidence stronglysupports pervasive transcription of mam-malian genomes, although the biologicalsignificance of many novel coding andnoncoding transcripts remains to be ex-plored.

Previous Evidence for PervasiveTranscription

The conclusion that the mammaliangenome is pervasively transcribed (i.e.,‘‘that the majority of its bases are associ-ated with at least one primary transcript’’[1]) was based on multiple lines ofevidence. Both large-scale cDNA sequenc-ing and hybridization to genome-widetiling arrays were the major empiricalsources of data. Analysis of full-lengthcDNAs from many tissues and develop-mental stages in mouse showed that atleast 63% of the genome is transcribed andidentified thousands of novel protein-coding transcripts and over 30,000 longnoncoding intronic, intergenic, and anti-sense transcripts [7–9]. In parallel, wholechromosome tiling array interrogation ofthe RNA content of a variety of humantissues and cell lines revealed that, collec-tively, at least 93% of genomic bases aretranscribed in one cell type or another[1,10–13].

Since it is well established that highlyexpressed mRNAs dominate the non-

ribosomal portion of the polyA+ transcrip-tome [7,8,10,14–19], normalization ap-proaches were used to reduce the quantityof highly expressed transcripts in thesecDNA analyses [7,8], and are implicit intiling array approaches. This was neces-sary to allow the detection of rarer (oftencell type–restricted [1,13,16,19,20]) tran-scripts.

The evidence for pervasive transcriptionalso includes observations from a widevariety of other independent techniques(see reviews [21] and [22] for references).Indeed, a simple query of currentlyavailable human spliced EST data inGenBank shows that documented tran-scripts cover 57.09% of the genome.Because ESTs are largely generated frompolyadenylated RNAs and do not exhaus-tively sample the transcriptome, this cov-erage represents the lower bound ofgenomic transcription.

Based on an analysis of genome-widetiling arrays and short read RNA sequenc-ing data, van Bakel et al. report that ‘‘most‘dark matter’ transcripts (i.e., novel tran-scripts of unknown function) are associated

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.

Citation: Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, et al. (2011) The Reality of PervasiveTranscription. PLoS Biol 9(7): e1000625. doi:10.1371/journal.pbio.1000625

Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America

Published July 12, 2011

Copyright: ! 2011 Clark et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: This work was supported by the Australian Research Council and National Health and MedicalResearch Council (JSM), NHGRI (TRG, PC), NIH (KVM, MG), Japanese MEXT Research Grant for the RIKEN OmicsScience Center (YH, PC), EU FP-7 (PFS), Damon Runyon, Smith, Merkin and Searle scholarships (JLR), and UKBBSRC, ERC and MRC (CPP). The funders had no role in study design, data collection and analysis, decision topublish, or preparation of the manuscript. We thank Tim Hughes and Harm van Bakel for unreservedlyproviding their data to us for analysis.

Competing Interests: The authors have declared that no competing interests exist.

Abbreviation: PR, precision recall

* E-mail: [email protected] (JSM); [email protected] (TRG)

. These authors contributed equally to this work.

PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1000625

Perspective

Response to ‘‘The Reality of Pervasive Transcription’’Harm van Bakel1, Corey Nislow1,2, Benjamin J. Blencowe1,2, Timothy R. Hughes1,2*

1 Banting and Best Department of Medical Research and Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario,

Canada, 2 Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada

Clark et al. criticize several aspects ofour study [1], and specifically challengeour assertion that the degree of pervasivetranscription has previously been overstat-ed. We disagree with much of theirreasoning and their interpretation of ourwork. For example, many of our conclu-sions are based on overall sequence readdistributions, while Clark et al. focus ontranscript units and seqfrags (sets ofoverlapping reads). A key point is thatone can derive a robust estimate of therelative amounts of different transcripttypes without having a complete recon-struction of every single transcript.

In this brief response, we first revisitwhat is meant by pervasive transcription,and its potential significance. We thendiscuss the major points raised by Clarket al. in the order presented in theircritique. Finally, we demonstrate thatconclusions very similar to those of ouroriginal study are reached with a datasetwith far greater read depth, obtained bystrand-specific sequencing of rRNA-de-pleted total RNA from a single cell type.

The Meaning of ‘‘Pervasive’’,and the Importance ofTranscript Abundance

Clark et al. define pervasive transcrip-tion of a genome to mean ‘‘that themajority of its bases are associated withat least one primary transcript’’, which isthe same definition used in the ENCODE1% paper [2]. We believe that this specificclaim is not contested, nor is it particularlyinteresting. First, it has long been assumedthat roughly half of the human genomecomprises introns [3]. Second, the mech-anisms that control the positions ofinitiation and termination of Pol IItranscription, as well as RNA processing,are imperfect, such that low-level back-ground transcripts from both physiologi-cally relevant and non-canonical sites arise[4–6]. Blockage of surveillance mecha-nisms that normally degrade such ‘‘cryp-

tic’’ transcripts greatly increases theirabundance [7,8].

We acknowledge that the phrase quotedby Clark et al. in our Author Summaryshould have read ‘‘stably transcribed’’, orsome equivalent, rather than simply ‘‘tran-scribed’’. But this does not change the factthat we strongly disagree with the funda-mental argument put forward by Clarket al., which is that the genomic areacorresponding to transcripts is more im-portant than their relative abundance. Thisviewpoint makes little sense to us. Given thevarious sources of extraneous sequencereads, both biological and laboratory-derived (see below), it is expected that withsufficient sequencing depth the entiregenome would eventually be encompassedby reads. Our statement that ‘‘the genomeis not as not as pervasively transcribed aspreviously reported’’ stems from the factthat our observations relate to the relativequantity of material detected.

Of course, some rare transcripts (and/or rare transcription) are functional, andlow-level transcription may also provide apool of material for evolutionary tinkering.But given that known mechanisms—inparticular, imperfections in termination(see below)—can explain the presence oflow-level random (and many non-random)transcripts, we believe the burden of proofis to show that such transcripts are indeedfunctional, rather than to disprove theirputative functionality.

Contradiction of PreviousReports

The fact that our analyses contradictprevious reports is precisely why we

emphasized the lack of abundant pervasivetranscription in our study. Clark et al. citepapers that have previously documentedpervasive transcription, and point out thatseveral different approaches have beenused as confirmation. We believe that Clarket al. misinterpret what can be claimedfrom much of the literature in this area, andfail to acknowledge known weaknesses insome of these studies. We previouslyreviewed these issues [9]. For example,the number of transfrags detected inpermuted tiling array data can be as highas it is in the real data [10]. In addition, acommon form of ‘‘validation’’ in thesepapers is RT-PCR or RACE, but theseapproaches are generally semi-quantitativeat best and are prone to artefacts such astemplate switching, which readily produceschimeric transcripts in vitro ([11] andreferences therein). Indeed, we note thatin the ENCODE 1% study [2] repeatedlycited by Clark et al., 75 of the 100 negativecontrols (randomly selected non-transfragregions) were actually detected by RACE,making the ‘‘validation’’ rate for negativecontrols only slightly lower than that for theintronic and intergenic transfrags (86%–88%). Thus, either the tiling arrays orRACE assays are highly error prone. Thecontention of Clark et al. that ‘‘any estimateof the pervasiveness of transcription re-quires inclusion of all data sources’’ isflawed, because if one introduces erroneousdata from even a single source, the estimatebecomes worse.

Accuracy of Tiling Arrays

We agree that results obtained fromtiling arrays should improve with increased

The Perspective section provides experts with aforum to comment on topical or controversial issuesof broad interest.

Citation: van Bakel H, Nislow C, Blencowe BJ, Hughes TR (2011) Response to ‘‘The Reality of PervasiveTranscription’’. PLoS Biol 9(7): e1001102. doi:10.1371/journal.pbio.1001102

Academic Editor: Michael B. Eisen, University of California Berkeley, United States of America

Published July 12, 2011

Copyright: ! 2011 van Bakel et al. This is an open-access article distributed under the terms of the CreativeCommons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,provided the original author and source are credited.

Funding: This work was supported by Canadian Institutes of Health Research (CIHR; http://www.cihr-irsc.gc.ca/) Operating Grant to TRH (grant no. 49451). HvB was supported by a CIHR postdoctoral fellowship (grantno. 193588). The funders had no role in study design, data collection and analysis, decision to publish, orpreparation of the manuscript.

Competing Interests: The authors have declared that no competing interests exist.

* E-mail: [email protected]

PLoS Biology | www.plosbiology.org 1 July 2011 | Volume 9 | Issue 7 | e1001102

Page 15: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Defining functional DNA elements in the human genome Kellis M et al. PNAS 2014;111:6131-6138

Biochemical  evidence  not  enough  to  iden1fy  func1onal  RNAs  

Page 16: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

•  RNA  seq  course  

   

Page 17: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

The  RNA  seq  course  •  From  RNA  seq  to  reads  (Introduc1on)  •  Mapping  reads  programs  (Tuesday)  •  Transcriptome  reconstruc1on  using  reference  (Tuesday)  •  Transcriptome  reconstruc1on  without  reference  (Tuesday)  •  QC  analysis  (Wednesday)  •  Differen1al  expression  analysis  (Wednesday)  •  Gene  set  analysis  (Wednesday)  •  miRNA  analysis  (Thursday)  •  Allele  specific  analysis  (Thursday)  •  Single  cell  analysis    (Thursday)  

Page 18: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

How  are  RNA-­‐seq  data  generated?  

Sampling  process  

Page 19: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Depending  on  the  different  steps  you  will  get  different  results  

AAAAAAAA  

enrichments  -­‐>  

reads  -­‐>  

library  -­‐>  

RNA-­‐>   PolyA                              (mRNA)  RiboMinus            (-­‐  rRNA)  Size    <50  nt          (miRNA  )  …..    

Size  of  fragment  Strand  specific  5’  end  specific    3’  end  specific  …..    

Single  end  (1  read  per  fragment)  Paired  end  (2  reads  per  fragment)  

Page 20: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and
Page 21: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Promises  and  pi7alls  

Long  reads  •  Low  throughput              (-­‐)  •  Complete  transcripts  (+)  •  Only  highly  expressed  

genes                  (-­‐-­‐)  •  Expensive          (-­‐)  •  Low  background  noise  (+)  •  Easy  downstream  analysis  

(+)    

   

Micro  Arrays  •  High  throughput        (+)  •  Only  known  sequences    (-­‐)  •  Limited  dynamic  range      (-­‐)  •  Cheap                                      (+)  •  High  background  noise    (-­‐)  •  Not  strand  specific      (-­‐)  •  Well  established  downstream  

methods        (+)      

short  reads  •  High  throughput        (+)  •  Frac1ons  of  transcripts        (-­‐)  •  Full  dynamic  range      (+-­‐)  •  Unlimited  dynamic  range                (+)  •  Cheap                      (+)  •  Low  background  noise    (+)  •  Strand  specificity      (+)  •  Re-­‐sequencing      (+)  

 

1  

10  

100  

1000  

10000  

1   10   100   1000   10000   100000   1000000  

Signal  

#  trancripts/cell  

EST  

MicroArray  

RNAseq  

Page 22: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA  seq  reads  correspond  directly  to  abundance  of  RNAs  in  the  sample  

Page 23: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Map  reads  to  reference  

Page 24: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

     

Transcriptome  assembly  using  reference    

Page 25: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Transcriptome  assembly  without  reference    

Page 26: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and
Page 27: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Quality  control  -­‐samples  might  not  be  what  you  think  they  are  •  Experiments  go  wrong  –  30  samples  with  5  steps  from  samples  to  reads  has  150  poten1al  steps  for  errors  

–  Error  rate  1/100  with  5  steps  suggest  that  one  of  every  20  samples  the  reads  does  not  represent  the  sample    

•  Mixing  samples  –  30  samples  with  5  steps  from  samples  to  reads  has  ~24M  poten1al  mix  ups  of  samples    

–  Error  rate  1/  100  with  5  steps  suggest  that  one  of  every  20  sample  is  mislabeled    

•  Combine  the  two  steps  and  approximately  one  of  every  10  samples  are  wrong  

Page 28: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

RNA  QC  Read  quality  

Transcript  quality  

Mapping  sta1s1cs  

Compare  between  samples  

Page 29: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Differen1al  expression  analysis  using  univariate  analysis  

Typically  univariate  analysis  (one  gene  at  a  1me)  –  even  though  we  know  that  genes  are  not  independent  

Page 30: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Gene  set  analysis  and  data  integra1on    

26

RNase Y and virulence

- 38% of total virulence factors are affected by RNase Y deletion (88% downregulated) - 18% cleaved by RNase Y

with the help of Cristina Della Beffa

¾ SCP hidden by gene downregulation ?

Page 31: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

microRNA  analysis  (Johan)  

(Berezikov  et  al.  Genome  Research,  2011.)  

Page 32: RNA$seq(Introduc1on( - GitHub Pages · intergenic seqfrags in RNA-Seq data are near known genes, consistent with alternative cleavage and polyadenylation site usage, promoter- and

Single  cell  RNA-­‐seq  analysis  

(Sandberg,  Nature  Methods  2014)