De novo transcript sequence reconstruction from RNA-seq ...

28
De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis BRIAN J HAAS, ALEXIE PAPANICOLAOU, MORAN YASSOUR, MANFRED GRABHERR, PHILIP D BLOOD, JOSHUA BOWDEN, MATTHEW BRIAN COUGER, DAVID ECCLES, BO LI, MATTHIAS LIEBER, MATTHEW D MACMANES, MICHAEL OTT, JOSHUA ORVIS, NATHALIE POCHET, FRANCESCO STROZZI, NATHAN WEEKS, RICK WESTERMAN, THOMAS WILLIAM, COLIN N DEWEY, ROBERT HENSCHEL, RICHARD D LEDUC, NIR FRIEDMAN & AVIV REGEV NATURE PROTOCOLS 8, 2013 Anti Alman 22.05.2014

Transcript of De novo transcript sequence reconstruction from RNA-seq ...

Page 1: De novo transcript sequence reconstruction from RNA-seq ...

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysisB R I A N J H A A S , A L E X I E P A P A N I C O L A O U , M O R A N Y A S S O U R , M A N F R E D G R A B H E R R , P H I L I P D B L O O D , J O S H U A B O W D E N , M A T T H E W B R I A N C O U G E R , D A V I D E C C L E S , B O L I , M A T T H I A S L I E B E R , M A T T H E W D M A C M A N E S , M I C H A E L O T T , J O S H U A O R V I S , N A T H A L I E P O C H E T , F R A N C E S C O S T R O Z Z I , N A T H A N W E E K S , R I C K W E S T E R M A N , T H O M A S W I L L I A M , C O L I N N D E W E Y , R O B E R T H E N S C H E L , R I C H A R D D L E D U C , N I R F R I E D M A N & A V I V R E G E V

N A T U R E P R O T O C O L S 8 , 2 0 1 3

Anti Alman22.05.2014

Page 2: De novo transcript sequence reconstruction from RNA-seq ...

IntroductionPlatform for de novo transcriptome assembly◦ From RNA-seq data (only Illumina)

◦ Mainly for non-model organisms

◦ Fully reconstructs a large fraction of the transcripts present in the data

◦ Including alternative splice isoforms and transcripts from recently duplicated genees (with some caveats)

Page 3: De novo transcript sequence reconstruction from RNA-seq ...

Introduction IIOriginal methodology published in 2011

Used in many different research projects◦ Genome sequence of foxtail millet (Setaria italica) provides insights into

grass evolution and biofuel potential

◦ The African coelacanth genome provides insights into tetrapod evolution

Significantly improved since 2011◦ memory requirements halved

◦ increased performance trough parallelization

◦ seamlessly uses various third-party tools

Page 4: De novo transcript sequence reconstruction from RNA-seq ...

Trinity de novo assemblyThree consecutive modules

◦ Inchworm

◦ Chrysalis

◦ Butterfly

Page 5: De novo transcript sequence reconstruction from RNA-seq ...

InchwormInchworm assembles the read data set by greedily searching for paths in a k-mer graph, resulting in a collection of linear contigs with each k-mer present only once in the contigs.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 6: De novo transcript sequence reconstruction from RNA-seq ...

Inchworm

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Constructs a k-mer dictionary from all sequence reads

Selects the most frequent k-mer in the dictionary (seed)

Extends the seed in each direction by finding the highest occurring k-mer with a k-1 overlap

Extends the sequence in either direction until it cannot be extended further, then reporting the linear contig

Page 7: De novo transcript sequence reconstruction from RNA-seq ...

Inchwormcontiguous (fused) transcripts

Page 8: De novo transcript sequence reconstruction from RNA-seq ...

ChrysalisChrysalis pools (clusters) contigs into components

◦ If they have at least k-1 overlap

◦ If enough reads span the join

An individual de Bruijn graph is built from each pool

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 9: De novo transcript sequence reconstruction from RNA-seq ...

de Bruijn graphEvery edge is a k-mer

Every node is a k-1 overlap

HTTP://GCAT.DAVIDSON.EDU/PHAST/DEBRUIJN.HTML

Page 10: De novo transcript sequence reconstruction from RNA-seq ...

Chrysalis IIt recursively groups inchworm contigs into connected components.

◦ If there is a perfect overlap of k-1 bases

◦ If there is a minimal number of reads that span the junction across both contigs◦ with a (k-1)/2 bases match on each side of the (k-1)-mer junction.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 11: De novo transcript sequence reconstruction from RNA-seq ...

Chrysalis IIIt builds a de Bruijn graph for each component

◦ using a word size of k-1 to represent nodes

◦ k to define the edges connecting the nodes.

It weights each edge of the de Bruijn graph with the number of (k-1)-mers in the original read set that support it.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 12: De novo transcript sequence reconstruction from RNA-seq ...

Chrysalis IIIEach read is assigned to the component with which it shares the largest number of k-mers.

Determines the regions within each read that contribute k-mers to the component.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 13: De novo transcript sequence reconstruction from RNA-seq ...

ButterflyButterfly takes each de Bruijn graph from Chrysalis and trims spurious edges and compacts linear paths.

It then reconciles the graph with reads and pairs.

It outputs one linear sequence for each splice form and/or paralogous transcript reflected in the graph.

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 14: De novo transcript sequence reconstruction from RNA-seq ...

ButterflyButterfly iterates between

◦ merging consecutive nodes in linear paths

◦ pruning edges that represent minor deviations

Reads are typically much longer than k◦ can resolve ambiguities

◦ reduce the combinatorial number of paths

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 15: De novo transcript sequence reconstruction from RNA-seq ...

ButterflyAlternatively spliced transcripts

Page 16: De novo transcript sequence reconstruction from RNA-seq ...

Transcript reconstructionS.pombe

Oracle◦ Empirical upper limit based on reads and known protein-coding sequences

GRABHERR, M.G. ET AL. NAT. BIOTECHNOL. 29, 644–652 (2011)

Page 17: De novo transcript sequence reconstruction from RNA-seq ...

Expression profiles reference vs Trinity

Page 18: De novo transcript sequence reconstruction from RNA-seq ...

Protocol exampleSchizosaccharomyces pombe grown in four conditions

◦ 4 million paired-end reads

◦ Requires 8GB RAM (1GB per million)

◦ Takes approximately 4 h

Main steps◦ Collection of RNA-seq data (10 min)

◦ De novo RNA-seq assembly using Trinity (60-90 min)

◦ Quality assessment (90 min)

◦ Abundance estimation using RSEM (40-60 min)

◦ Differential expression analysis using edgeR (<5 min)

Page 19: De novo transcript sequence reconstruction from RNA-seq ...

AlternativesVelvet – de Bruijn

ABYSS – de Bruijn

Mira – overlap graph

Oases – de Bruijn◦ Based on Velvet

Page 20: De novo transcript sequence reconstruction from RNA-seq ...

Comparison

SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162

On randomly generated short reads from chromosome 22

Page 21: De novo transcript sequence reconstruction from RNA-seq ...

Comparison II

SCIENCE CHINA FEBRUARY 2013 VOL.56 NO.2: 156–162

10 highest concentration RNAs in the ERCC mix

Page 22: De novo transcript sequence reconstruction from RNA-seq ...

After de novo RNA-seqassemblyRelies on third-party tools

Transcriptome analysis package for non-model organisms◦ Comparing transcriptomes across samples

◦ Transcript abundance estimation

◦ Analysis of differentially expressed transcripts

◦ Protein-coding region prediction and functional annotation of Trinity transcripts

Page 23: De novo transcript sequence reconstruction from RNA-seq ...

Comparing transcriptomesacross samplesCombine all reads across all samples into a single RNA-seq data set

Generate a single reference Trinity assembly

Aligning each sample’s (not normalized) reads to the Trinity assembly

Page 24: De novo transcript sequence reconstruction from RNA-seq ...

Transcript abundance estimationRe-align reads to the assembled transcripts

◦ Alternatively spliced isoforms and recently duplicated genes?

◦ RNA-seq by Expectation Maximization (RSEM)

◦ Requires gap-free alignments

edgeR - compare expression levels of different transcripts or genes across samples

Page 25: De novo transcript sequence reconstruction from RNA-seq ...

Analysis of differentially expressed transcriptsRelies on tools from the Bioconductor

◦ edgeR

◦ DESeq

Easy-to-use perl scripts

Page 26: De novo transcript sequence reconstruction from RNA-seq ...

Protein-coding region prediction and functional annotation of Trinity transcriptsTransDecoder identifies candidate protein-coding regions

◦ Based on nucleotide composition

◦ Open reading frame length

◦ Pfam domain content

Page 27: De novo transcript sequence reconstruction from RNA-seq ...

LimitationsOnly for Illumina RNA-seq data

Difficult to fully understand the structural basis for the observed transcript variations

Sequence variations that cannot be properly phased can result in erroneous chimeras between isoforms.

Incorrect transcript assembly or isoform misalignment can be easily misinterpreted as evidence of polymorphism

Page 28: De novo transcript sequence reconstruction from RNA-seq ...

Thank you for attention!Trinity is installed in alligaator.at.mt.ut.ee

◦ /usr/local/trinityrnaseq_r20140413p1

Questions:1. What is the difference between model organisma and non-model

organism?

2. Why do we need de novo transcriptome assembly and what makes it difficult?