Convergent evolution of squid photophores and transcriptomes
REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION
-
Upload
lisa-johnson-cohen -
Category
Science
-
view
485 -
download
1
Transcript of REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND EVALUATION
REASSEMBLING 600+ MARINE TRANSCRIPTOMES: AUTOMATED PIPELINE DEVELOPMENT AND
EVALUATION
Lisa Cohen, Harriet Alexander, C. Titus BrownLab for Data Intensive Biology (DIB), UC Davis
ASLO Aquatic Sciences meetingSession 016: Advances in Aquatic Meta-Omics
March 3, 2017
Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP)
- 678 Illumina RNA sequence datasets = 1 TB raw data - Wide diversity spanning more than 40 phyla- Original assemblies by the National Center for Genome Resources (NCGR)
Keeling et al. 2014 PMID: 24959919
Caron et al. 2016PMID: 27867198
Need for a modularized, extensible RNA-seq pipeline:o Software and best practices for RNA-seq analysis changing rapidly
(Conesa et al. 2016, PMID: 26813401)
o Accumulating more and more data!
o MMETSP: awesome data set to test software and pipelines!
o What to do if:• New samples to add?• New software tool is developed?
Metadata from NCBI
PRJNA231566
download data
Trinity assembly
Trinity.fasta
evaluation
annotation
expression quantification
Adapted from the Brown lab, “Eel Pond mRNA-seq Protocol”: http://eel-pond.readthedocs.io/en/latest/ Titus Brown, Camille Scott, and Leigh Sheneman
trim, fastqc diginorm
Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.https://doi.org/10.6084/m9.figshare.3840153.v6
1TB raw storage,>8,000 computing hours
Num
ber o
f con
tigs
17 610
48,005
25,059
NCGR DIB
Our re-assemblies have more contigs:# higher in NCGR # higher in DIB
Questions:
1. Did we generate more biologically-meaningful content with re-assemblies?
2. Are there phylogenetic patterns in the assemblies?
Smith-Unna et al. 2016PMID: 27252236
Transrate score = overall quality of the final assembly (scale 0-1.0)
Qualities of our re-assemblies are higher:
1. Did we generate more biologically-meaningful content with re-assemblies?
Tran
srat
e sc
ore 0.31
0.22
NCGR DIB
Re-assemblies generally contain most of the information in the NCGR assemblies, plus ~30% more content:
Comparison: DIB vs. NCGR
DIB
NCGR
Prop
ortio
n of
con
tigs (
CRB-
BLAS
T)Comparison: NCGR vs. DIB
1. Did we generate more biologically-meaningful content with re-assemblies?
NCGR DIB
Similar Open Reading Frame (ORF) andBenchmarks of Universal Single Copy Orthologs (BUSCO)
1. Did we generate more biologically-meaningful content with re-assemblies?
Mea
n O
RF p
erce
ntag
e
Com
plet
e BU
SCO
per
cent
age
NCGR DIBNCGR DIB
Scott, C. in prep. 2016. www.camillescott.org/dammit
‘dammit’ annotation pipeline: Pfam, Rfam, OrthoDB
annotated absent transcripts transcripts absent from NCGR
# Tr
ansc
ripts
MMETSP sample (sorted)
1. Did we generate more biologically-meaningful content with re-assemblies?
After annotation, ~30% extra content appears real
DIB
NCGR
Extra content
Some DIB assemblies have more unique content.Unique k-mers (k=25), unique word combinations
1. Did we generate more biologically-meaningful content with re-assemblies?Probably.
Unique k-mers (DIB)
Unique k-mers (NCGR)
Assemblies from Dinophyta have more unique k-mers and lower qualities.
Dinoflagellates: steady-state gene expression, translational gene regulation Aranda et al. 2016 PMID: 28004835Lin 2011 PMID: 21514379 Hou and Lin 2009. PMID: 27426948
N=173111736160602522
2. Can we detect phylogenetic differences in the assemblies?
Unique k-mers = unique word combinations (k=25)
Ciliophora have lower ORF percentagesN=173111736160602522
Ciliates: alternative triplet codon dictionary, STOP codon different purposeAlkalaeva and Mikhailova 2016, PMID: 28009453 Heaphy et al. 2016, PMID: 27501944Swart et al. 2016, PMID: 27426948
2. Are there phylogenetic differences in the assemblies?Trends.
Mean % ORF
# contigs
Future work:• In-depth annotation analysis• Orthologous groupings of contigs• Co-expression network analysis
• Better reference transcriptomes for MMETSP available: https://doi.org/10.6084/m9.figshare.3840153.v6
• Strain-specific trends in assemblies support previously-reported transcriptomic features
• De novo transcriptome assembly pipeline available: https://github.com/dib-lab/dib-MMETSP
Conclusions
Contact:
Acknowledgements• Data Intensive Biology Lab
– Camille Scott, Luiz Irber• MSU iCER• NSF’s XSEDE, Jetstream cloud
• Substituting for my NPB101D sections today:– Natalia Caporale, Sheryar
Siddiqui, Pearl Chen, Arik Davidyan, Karl Larson
Photo by James Word
Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/
Files available for download!Cohen, Lisa; Alexander, Harriet; Brown, C. Titus (2017): Marine Microbial Eukaryotic Transcriptome Sequencing Project, re-assemblies. figshare.https://doi.org/10.6084/m9.figshare.3840153.v6
https://github.com/dib-lab/dib-MMETSP
Data Intensive Biology Summer Institute, applications due March 17th! http://ivory.idyll.org/dibsi/