REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

16
REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION Lisa Cohen, Harriet Alexander, C. Titus Brown Lab for Data Intensive Biology (DIB), UC Davis ASLO Aquatic Sciences meeting Session 016: Advances in Aquatic Meta-Omics March 3, 2017 @monsterbashseq ljcohen@ucdavis .edu

Transcript of REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Page 1: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND

EVALUATION

Lisa Cohen, Harriet Alexander, C. Titus BrownLab for Data Intensive Biology (DIB), UC Davis

ASLO Aquatic Sciences meetingSession 016: Advances in Aquatic Meta-Omics

March 3, 2017

@[email protected]

Page 2: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP)

- 678 Illumina RNA sequence datasets = 1 TB raw data - Wide diversity spanning more than 40 phyla- Original assemblies by the National Center for Genome Resources (NCGR)

Keeling et al. 2014 PMID: 24959919

Caron et al. 2016PMID: 27867198

Page 3: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Need for a modularized, extensible RNA-seq pipeline:o Software and best practices for RNA-seq analysis changing rapidly

(Conesa et al. 2016, PMID: 26813401)

o Accumulating more and more data!

o MMETSP: awesome data set to test software and pipelines!

o What to do if:• New samples to add?• New software tool is developed?

Page 4: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Metadata from NCBI

PRJNA231566

download data

Trinity assembly

Trinity.fasta

evaluation

annotation

expression quantification

Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/ Titus Brown, Camille Scott, and Leigh Sheneman

trim, fastqc diginorm

Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.https://doi.org/10.6084/m9.figshare.3840153.v6

1TB raw storage,>8,000 computing hours

Page 5: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Num

ber o

f con

tigs

17 610

48,005

25,059

NCGR DIB

Our re-assemblies have more contigs:# higher in NCGR # higher in DIB

Page 6: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Questions:

1. Did we generate more biologically-meaningful content with re-assemblies?

2. Are there phylogenetic patterns in the assemblies?

Page 7: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Smith-Unna et al. 2016PMID: 27252236

Transrate score = overall quality of the final assembly (scale 0-1.0)

Qualities of our re-assemblies are higher:

1. Did we generate more biologically-meaningful content with re-assemblies?

Tran

srat

e sc

ore 0.31

0.22

NCGR DIB

Page 8: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Re-assemblies generally contain most of the information in the NCGR assemblies, plus ~30% more content:

Comparison: DIB vs. NCGR

DIB

NCGR

Prop

ortio

n of

con

tigs (

CRB-

BLAS

T)Comparison: NCGR vs. DIB

1. Did we generate more biologically-meaningful content with re-assemblies?

NCGR DIB

Page 9: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Similar Open Reading Frame (ORF) andBenchmarks of Universal Single Copy Orthologs (BUSCO)

1. Did we generate more biologically-meaningful content with re-assemblies?

Mea

n O

RF p

erce

ntag

e

Com

plet

e BU

SCO

per

cent

age

NCGR DIBNCGR DIB

Page 10: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Scott, C. in prep. 2016. www.camillescott.org/dammit

‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB

annotated absent transcripts transcripts absent from NCGR

# Tr

ansc

ripts

MMETSP sample (sorted)

1. Did we generate more biologically-meaningful content with re-assemblies?

After annotation, ~30% extra content appears real

DIB

NCGR

Extra content

Page 11: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Some DIB assemblies have more unique content.Unique k-mers (k=25), unique word combinations

1. Did we generate more biologically-meaningful content with re-assemblies?Probably.

Unique k-mers (DIB)

Unique k-mers (NCGR)

Page 12: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Assemblies from Dinophyta have more unique k-mers and lower qualities.

Dinoflagellates: steady-state gene expression, translational gene regulation Aranda et al. 2016 PMID: 28004835Lin 2011 PMID: 21514379 Hou and Lin 2009. PMID: 27426948

N=173111736160602522

2. Can we detect phylogenetic differences in the assemblies?

Unique k-mers = unique word combinations (k=25)

Page 13: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Ciliophora have lower ORF percentagesN=173111736160602522

Ciliates: alternative triplet codon dictionary, STOP codon different purposeAlkalaeva and Mikhailova 2016, PMID: 28009453 Heaphy et al. 2016, PMID: 27501944Swart et al. 2016, PMID: 27426948

2. Are there phylogenetic differences in the assemblies?Trends.

Mean % ORF

# contigs

Page 14: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Future work:• In-depth annotation analysis• Orthologous groupings of contigs• Co-expression network analysis

• Better reference transcriptomes for MMETSP available: https://doi.org/10.6084/m9.figshare.3840153.v6

• Strain-specific trends in assemblies support previously-reported transcriptomic features

• De novo transcriptome assembly pipeline available: https://github.com/dib-lab/dib-MMETSP

Conclusions

@[email protected]

Contact:

Page 15: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Acknowledgements• Data Intensive Biology Lab

– Camille Scott, Luiz Irber• MSU iCER• NSF’s XSEDE, Jetstream cloud

• Substituting for my NPB101D sections today:– Natalia Caporale, Sheryar

Siddiqui, Pearl Chen, Arik Davidyan, Karl Larson

Photo by James Word

Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/

Page 16: REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION

Files available for download!Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.https://doi.org/10.6084/m9.figshare.3840153.v6

https://github.com/dib-lab/dib-MMETSP

@[email protected]

Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/